Potential SQL Issues

Lists the most common SQL mistakes made during statement creation that are identified as potential issues by Workload XM. The Health Check list, on the engine’s Queries page, categorizes the health tests. For example, for Hive, MapReduce, Oozie, and Spark engines, the Insufficient Partitioning and Many Materialized Columns health checks, test for query and schema issues.

Table 1. Common SQL Issues
Potential SQL Issue Impact Recommendation
>5 table joins or > 10 join conditions found.

Possible performance impact, depending on the size of a table, partitioning keys, and filter and join conditions that are specified in the query.

To address this issue, denormalize tables to eliminate the need for joins.

>10 columns present in GROUP BY list.
Possible performance impact, depending on the number of distinct groups and the memory configuration.

To address this issue, evaluate the memory requirements for the query.

>10 Inline Views present in query.

Possible performance impact, depending on the memory configuration, especially if complex expressions are present in inline views on Impala.

To address this issue, evaluate the memory requirements and materialize inline views.

>50 query blocks present in large query.

Possible performance impact, depending on the memory configuration.

To address this issue, evaluate the query memory requirements, split the query into smaller queries, and materialize duplicate blocks.

>2000 expressions found in WHERE clause of a single query.

This is a hard limit enforced by Impala. The query fails if it contains >2000 expressions.

To address this issue, consolidate expressions by replacing repetitive sequences with single operators like IN or BETWEEN.

Cartesian or CROSS join found.

Performance impact if tables are large.

To address this issue, rewrite the query by adding join conditions and eliminate Cartesian joins.

High cardinality GROUP BY column found.

Possible performance impact, depending on the number of distinct groups and the memory configuration.

To address this issue, evaluate the memory requirements for the query.

Joins across large tables found.

Possible performance impact, depending on the partitioning keys, and filter and join conditions that are specified in the query.

To determine the cause, evaluate the EXPLAIN output on Impala.

To address this issue, evaluate the filter and join conditions, the query's memory requirements, and consider table partitioning strategies.

Join on a large table found.

Possible performance impact, depending on the partitioning keys, and filter and join conditions that are specified in the query.

To determine the cause, evaluate the EXPLAIN output on Hive or Impala.

To address this issue, evaluate the filter and join conditions, the query's memory requirements, and consider table partitioning strategies.

Many single-row inserts found.

Possible performance impact when using singleton inserts that create multiple small files instead of less large files.

To address this issue, batch inserts together, which prevents the creation of multiple small data files.

Popular CASE expression across queries found.

Possible performance improvement.

Consider materializing the CASE expression.

Popular filter conditions found.

Possible performance impact if the tables are large and are not partitioned.

To address this issue, consider table partitioning strategies on the filter conditions.

Popular inline views across queries found.

Possible performance impact, depending on the memory configuration, especially if complex expressions are used in inline views on Impala.

To address this issue, consider materializing the inline view.

Popular subqueries across queries found.

Possible performance improvement.

Consider materializing the subqueries.

Query has no filters.

Possible performance impact, if the result set that is returned is very large.

To address this issue, rewrite the query by adding filtering conditions that reduce the size of the result set that is returned.

Query on partitioned table is missing filters on partitioning columns.

Possible performance impact if the tables are large.

To address this issue, rewrite the query by adding filtering conditions.

Query with filter conditions on a large table found.

Possible performance impact if the tables are large and are not partitioned.

To address this issue, consider table partitioning strategies on the filter conditions.

Query with inline views found.

Possible performance impact, depending on the memory configuration, especially if complex expressions are used in inline views on Impala.

To address this issue, if the inline view is duplicated, evaluate whether materializing the inline view is advantageous.

Table might contain too many partitions (>30K).

May crash the Hive Metastore.

To address this issue, re-evaluate the partitioning key strategy, as queries that access multiple partitions are unlikely to finish processing.

Table might contain too many partitions (>50K). May crash the Hive Metastore.

To address this issue, re-evaluate the partitioning key strategy, as queries that access multiple partitions are unlikely to finish processing.

Table might contain too many partitions (>100K).

May crash the Hive Metastore.

To address this issue, re-evaluate the partitioning key strategy, as queries that access multiple partitions are unlikely to finish processing.

Unsupported commands: UPDATE or DELETE.

The UPDATE and DELETE commands are not supported on CDH.

To address this issue, rewrite the query with views or partitioning strategies that mimic UPDATE and DELETE. For example, using the partition-based INSERT with OVERWRITE or SELECT with views over main and delta tables are common workaround strategies.