ReferencePDF version

Data Warehouse Health Checks

Lists the data warehouse health check tests that are performed by Workload XM at the end of an Apache Impala job. They provide tips on how to make your workloads run faster and point out which aspects of your queries might be causing bottlenecks on your cluster. However, their recommendations are not exhaustive and there may be additional fixes other than those described. It is important to note that query tuning is as much an art as a science. If you are currently satisfied with your cluster performance, you can use these health checks as a way to gain insights into how your query workloads are running on your cluster. That said, the suboptimal conditions identified by these health checks might cause problems as new applications are added, the system footprint is expanded, or the overall load on the system increases. Use these health checks to proactively monitor potential issues across your cluster. You can find the data warehouse health checks on the Data Warehouse Queries page in the Health Check list and on the Data Warehouse Summary page in the Suboptimal Queries graph.

Table 1. Data Warehouse Health Checks
Health Check Description
Aggregation Spilled Partitions

Indicates that data spilled to disk during the aggregation operation for these queries.

This health check is triggered during aggregation if there is not enough memory, which causes data to spill to disk. If you are satisfied with your cluster performance despite this health check being triggered, you can disregard it.

If performance is an issue, try the following fixes:

  • Use a less complex GROUP BY clause that involves fewer columns (do not use a high cardinality GROUP BY clause).
  • Increase the setting for the query's MEM_LIMIT query option. See the Impala documentation.
  • Add more physical memory.

For more details, see the Impala documentation SQL Operations that Spill to Disk.

Bytes Read Skew

Indicates that one of the cluster nodes is reading a significantly larger amount of data than other nodes.

To address this condition, rebalance the data or use the Impala SCHEDULE_RANDOM_REPLICA query option. For additional suggestions, see Avoiding CPU Hotspots for HDFS Cached Data in the Impala documentation set.

Corrupt Table Statistics

Indicates that these queries contain table statistics that were incorrectly computed and cannot be used.

This condition can be caused by metastore database issues.

Recompute table statistics. For more information, see Detecting Missing Statistics in the Impala documentation set.

HashJoin Spilled Partitions

Indicates that data spilled to disk during the hash join operation for these queries.

This condition occurs when there is not enough memory during the hash join, which causes data to spill to disk.

To address this issue:

  • Reduce the cardinality of the right-hand side of the join by filtering more rows from it.
  • Add more physical memory.
  • Increase the setting for the query's MEM_LIMIT query option. See the Impala documentation.
  • Use a denormalized table.
Insufficient Partitioning

Indicates that there is insufficient partitioning for parallel query execution to occur for these queries.

This condition is triggered when query execution is wasting resources and time because the system is reading rows that are not required for the operation.

To address this condition:

  • Check to see if your more popular filters can become partition keys. For example, if you have many queries that use ship date as a filter, consider creating partitions using ship date as the partition key.
  • Add filters to your query for existing partition columns.

For more information, see Partitioning for Impala Tables in the Impala documentation set.

Many Materialized Columns

Indicates that an abnormally large number of columns were returned for these queries.

This condition is only triggered for Parquet tables. If you are reading more than 15 columns, this health check is triggered.

To address this condition, rewrite the query so it does not return more than 15 columns.

Missing Table Statistics

Indicates that no table statistics were computed for query optimization for these queries.

To address this condition, compute table statistics. For more information, see Detecting Missing Statistics in the Impala documentation set.

Slow Aggregate

Indicates that the aggregation operations were slower than expected for these queries.

Ten million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the time spent in the aggregation operation into the number of input rows.

Addressing this condition depends on the root cause:

  • If the root cause is resource conflicts with other queries, then allocate different resource pools to reduce conflicts.
  • If the root cause is overly complex GROUP BY operations, then rewrite the queries to simplify the GROUP BY operations.
Slow Client

Indicates that the client consumed query results slower than expected for these queries.

The causes and fixes for this health check can vary:

  • If the condition is triggered because some clients are taking too long to unregister the query, then use more appropriate clients for the workload. For example, if you are testing and building SQL queries, it might make more sense to use an interactive client over ODBC or JDBC.
  • If the condition is triggered because you are doing exploratory analysis and reading some rows and then waiting for some time to read the next set of rows, this uses up systems resources because the query has not closed. To remediate, consider using the Impala timeout feature. See Setting Timeout Periods for Daemons, Queries, and Sessions in the Impala documentation set. As an additional option, consider adding a LIMIT clause to your queries to limit the number of rows returned to 100 or less.
Slow Code Generation

Indicates that compiled code was generated more slowly than expected for these queries.

In every query plan fragment, Impala considers how much time is used to generate the code and this health check indicates that the time exceeded 20% of the overall query execution time. This might be triggered by query complexity. For example, if the query has too many predicates in its WHERE clauses, too many joins, or too many columns.

For queries where code generation is too slow, consider using the DISABLE_CODEGEN query option in your session.

Slow HDFS Scan
Indicates that scanning data from HDFS was slower than expected for these queries.

This condition is caused by a slow disk, extremely complex scan predicates, or the HDFS NameNode is too busy. The HDFS scan rate is based on the amount of time that the scanner took to read a specific number of rows.

This condition can be addressed by:

  • Replacing the disk if the cause is a slow disk.
  • Reduce complexity by simplifying the scan predicates.
  • If the HDFS NameNode is too busy, consider upgrading to CDH 5.15 or later. For more information, see Upgrading Cloudera Manager and CDH.
Slow Hash Join

Indicates that hash join operations were slower than expected for these queries.

This health check might be triggered when there are overly complex join predicates or the hash join is causing data to spill to disk. Five million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the number of input rows by the time spent in the hash join operation.

To fix this condition, simplify the join predicates or reduce the size of the right side of the join.

Slow Query Planning

Indicates that the query plan was generated more slowly than expected for these queries.

This health check is triggered when the query planning time exceeds 30% of the overall query execution time. This can be caused by very complex queries or if a metadata refresh occurs while the query is executing.

To fix this condition, consider simplifying your queries. For example, reduce the number of columns returned, reduce the number of filters, or reduce the number of joins.

Slow Row Materialization

Indicates that rows were returned more slowly than expected for these queries.

This health check is triggered if it takes more than 20% of the query execution time to return rows. It can be caused by overly complex expressions in the SELECT list or when too many rows are requested.

To address this condition, simplify the query by reducing the number of columns in the select list or by reducing the number of rows requested.

Slow Sorting Speed

Indicates that the sorting operations were slower than expected for these queries.

Ten million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the number of input rows by the time spent in the sorting operation.

To fix this condition, simplify the ORDER BY clause in queries. If data is spilling to disk, reduce the volume of data to be sorted by adding more predicates to the WHERE clauses, by increasing the available memory, or by increasing the value specified for the MEM_LIMIT query option. See the Impala documentation.

Slow Write Speed
Indicates that the query write speed is slower than expected for these queries.

If the difference between actual write time and the expected write time are more than 20% of the query execution time, this health check is triggered. This condition can be caused when overly complex expressions are used, too many columns are specified, or too many rows are requested in the SELECT list.

To address this condition, simplify the query by reducing the number of columns, or by reducing the complexity of the SELECT list expression.

We want your opinion

How can we improve this page?

What kind of feedback do you have?