Data Warehouse Health Checks
Lists the data warehouse health check tests that are performed by Workload XM at the end of an Apache Impala job. They provide tips on how to make your workloads run faster and point out which aspects of your queries might be causing bottlenecks on your cluster. However, their recommendations are not exhaustive and there may be additional fixes other than those described. It is important to note that query tuning is as much an art as a science. If you are currently satisfied with your cluster performance, you can use these health checks as a way to gain insights into how your query workloads are running on your cluster. That said, the suboptimal conditions identified by these health checks might cause problems as new applications are added, the system footprint is expanded, or the overall load on the system increases. Use these health checks to proactively monitor potential issues across your cluster. You can find the data warehouse health checks on the Data Warehouse Queries page in the Health Check list and on the Data Warehouse Summary page in the Suboptimal Queries graph.
Health Check | Description |
---|---|
Aggregation Spilled Partitions |
Indicates that data spilled to disk during the aggregation operation for these queries. This health check is triggered during aggregation if there is not enough memory, which causes data to spill to disk. If you are satisfied with your cluster performance despite this health check being triggered, you can disregard it. If performance is an issue, try the following fixes:
For more details, see the Impala documentation SQL Operations that Spill to Disk. |
Bytes Read Skew |
Indicates that one of the cluster nodes is reading a significantly larger amount of data than other nodes. To address this condition, rebalance the data or use the
Impala |
Corrupt Table Statistics |
Indicates that these queries contain table statistics that were incorrectly computed and cannot be used. This condition can be caused by metastore database issues. Recompute table statistics. For more information, see Detecting Missing Statistics in the Impala documentation set. |
HashJoin Spilled Partitions |
Indicates that data spilled to disk during the hash join operation for these queries. This condition occurs when there is not enough memory during the hash join, which causes data to spill to disk. To address this issue:
|
Insufficient Partitioning |
Indicates that there is insufficient partitioning for parallel query execution to occur for these queries. This condition is triggered when query execution is wasting resources and time because the system is reading rows that are not required for the operation. To address this condition:
For more information, see Partitioning for Impala Tables in the Impala documentation set. |
Many Materialized Columns |
Indicates that an abnormally large number of columns were returned for these queries. This condition is only triggered for Parquet tables. If you are reading more than 15 columns, this health check is triggered. To address this condition, rewrite the query so it does not return more than 15 columns. |
Missing Table Statistics |
Indicates that no table statistics were computed for query optimization for these queries. To address this condition, compute table statistics. For more information, see Detecting Missing Statistics in the Impala documentation set. |
Slow Aggregate |
Indicates that the aggregation operations were slower than expected for these queries. Ten million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the time spent in the aggregation operation into the number of input rows. Addressing this condition depends on the root cause:
|
Slow Client |
Indicates that the client consumed query results slower than expected for these queries. The causes and fixes for this health check can vary:
|
Slow Code Generation |
Indicates that compiled code was generated more slowly than expected for these queries. In every query plan fragment, Impala considers how much time is
used to generate the code and this
health check indicates that the time exceeded 20% of the overall
query execution time. This might be triggered by query complexity.
For example, if the query has too many predicates in its
For queries where code generation is too slow, consider using
the |
Slow HDFS Scan |
Indicates that scanning data from HDFS was slower than expected for these
queries.
This condition is caused by a slow disk, extremely complex scan predicates, or the HDFS NameNode is too busy. The HDFS scan rate is based on the amount of time that the scanner took to read a specific number of rows. This condition can be addressed by:
|
Slow Hash Join |
Indicates that hash join operations were slower than expected for these queries. This health check might be triggered when there are overly complex join predicates or the hash join is causing data to spill to disk. Five million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the number of input rows by the time spent in the hash join operation. To fix this condition, simplify the join predicates or reduce the size of the right side of the join. |
Slow Query Planning |
Indicates that the query plan was generated more slowly than expected for these queries. This health check is triggered when the query planning time exceeds 30% of the overall query execution time. This can be caused by very complex queries or if a metadata refresh occurs while the query is executing. To fix this condition, consider simplifying your queries. For example, reduce the number of columns returned, reduce the number of filters, or reduce the number of joins. |
Slow Row Materialization |
Indicates that rows were returned more slowly than expected for these queries. This health check is triggered if it takes more than 20% of the
query execution time to return rows. It can be caused by overly
complex expressions in the To address this condition, simplify the query by reducing the number of columns in the select list or by reducing the number of rows requested. |
Slow Sorting Speed |
Indicates that the sorting operations were slower than expected for these queries. Ten million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the number of input rows by the time spent in the sorting operation. To fix this condition, simplify the |
Slow Write Speed |
Indicates that the query write speed is slower than expected
for these queries.
If the difference between actual write time and the expected
write time are more than 20% of the query execution time, this
health check is triggered. This condition can be caused when
overly complex expressions are used, too many columns are
specified, or too many rows are requested in the
To address this condition, simplify the query by reducing the
number of columns, or by reducing the complexity of the
|