Impala health checks

Execution completion health checks

The execution metric determines whether a job failed or passed the Cloudera Observability health check.

Table 1. Execution
Health Check	Description
Failed - Any Healthcheck	Displays jobs that failed at least one health check.

Metadata/Statistics health checks

The metadata/statistic metrics test the distribution of values in one or more columns of the data table for query optimization.

Table 2. Metadata/Statistics
Health Check	Description	Recommendation
Corrupt Table Statistics	Indicates that these queries contain table statistics that were incorrectly computed and therefore cannot be used. note This condition may be caused from Metastore database issues.	To address this condition, consider recomputing the table statistics. For more information, see the Impala documentation.
Missing Table Statistics	Indicates that no table statistics were computed for query optimization.	To address this condition, consider computing the table statistics. For more information, see the Impala documentation.

Optimal configuration health checks

The optimal configuration metrics determine whether the query's operation performance was impacted by insufficient resources

Table 3. Optimal Configuration
Health Check	Description	Recommendation
Aggregation Spilled Partitions	Indicates that during the query's aggregation operation, data was spilled to disk. This health check is triggered when there is not enough memory to complete the operation.	To address this condition, consider: Replacing the high-cardinality `GROUP-BY` clauses, which can lead to memory issues, with low-cardinality clauses that organize your data with fewer columns. Increasing the query's memory limit setting with the `MEM_LIMIT` query option. Adding more physical memory. For more information, see the Impala documentation.
HashJoin Spilled Partitions	Indicates that during the query's hash join operation, data was spilled to disk. This health check is triggered when there is not enough memory to complete the operation.	To address this condition, consider: Reducing the cardinality from the right-hand side of the join by filtering more rows. Increasing the query's memory limit setting with the `MEM_LIMIT` query option. Using a denormalized table. Adding more physical memory.
Slow Client	Indicates that the client consumed the query results slower than expected.	To address this condition depends on the root cause. For example: If the condition is triggered because some clients are taking too long to unregister the query, consider using more appropriate clients for the workload. Such as using an interactive client rather than an ODBC or a JDBC client when testing and building SQL queries. If the condition is triggered because the client is unable to close the query in a timely manner, consider using the Impala Timeout feature. Such as when your Impala job contains wait times between reading each set of rows during exploratory analysis. This example, will also deplete system resources. Additionally, consider limiting the number of returned rows to 100 or less by adding a LIMIT clause to your queries. For more information about setting timeout periods for daemons, queries, and sessions, see the Impala documentation.

Performance health checks

The performance metrics measure the query's execution times.

Table 4. Performance
Health Check	Description	Recommendation
Slow Aggregate	Indicates that the aggregation operations were slower than expected. This health check is triggered when the observed throughput is less than ten million rows per second. note Observed throughput is calculated by dividing the time spent in the aggregation operation with the number of input rows.	To address this condition depends on the root cause. For example: If the root cause is from resource conflicts with other queries, consider reducing conflicts by allocating different resource pools. If the root cause is from overly complex `GROUP BY` operations, consider rewriting the queries with simpler `GROUP BY` operations.
Slow Code Generation	Indicates that the compiled code was generated slower than expected. This health check is triggered when the generation time exceeds 20% of the overall query execution time. note For every query plan fragment, Impala considers how much time is used to generate the code.	This condition may be triggered due to an overly complex query. For example, if the query has too many predicates in its `WHERE` clause, contains too many joins, or contains too many columns. To address this condition, consider using the `DISABLE_CODEGEN` query option in your session.
Slow HDFS Scan	Indicates that the time taken to scan data from HDFS was slower than expected. note The HDFS scan rate is based on the amount of time the scanner takes to read a specific number of rows.	This condition is caused by either a slow disk, extremely complex scan predicates, or a busy HDFS NameNode. important If the workload is accessing data stored on Amazon S3 this condition may be triggered. Slow HDFS scanning is a known limitation of this storage platform. Depending on the cause, to address this condition consider the following: If the cause is a slow disk, replace the disk. If the cause is through complex scan predicates, reduce the complexity by simplifying the scan predicates. If the cause is due to a busy HDFS NameNode, consider upgrading.
Slow Hash Join	Indicates that the hash join operations were slower than expected. This health check is triggered when the observed throughput is less than five million rows per second. note Observed throughput is calculated by dividing the number of input rows by the time spent in the hash join operation.	This condition may be triggered when there are overly complex join predicates or a hash join is causing data to spill to disk. To address this condition, consider simplifying the join predicates or reducing the size on the right-hand side of the join.
Slow Query Planning	Indicates that the query plan generated slower than expected. This health check is triggered when the query planning time exceeds 30% of the overall query execution time.	This condition may be caused by overly complex queries or if a metadata refresh occurred whilst the query was executing. To address this condition, consider simplifying your queries. For example, reduce the number of columns returned, reduce the number of filters, or reduce the number of joins.
Slow Row Materialization	Indicates that rows were returned slower than expected. This health check is triggered when it takes more than 20% of the query execution time to return rows.	This condition may be caused when overly complex expressions are used in the `SELECT` list or when too many rows are requested. To address this condition, simplify the query by either reducing the number of columns in the selected list or reducing the number of requested rows.
Slow Sorting	Indicates that the sorting operations were slower than expected. This health check is triggered when the observed throughput is less than ten million rows per second. note Observed throughput is calculated by dividing the number of input rows by the time spent in the sorting operation.	To address this condition, consider the following: Simplify the `ORDER BY` clause in your queries. If data is spilling to disk, reduce the amount of data to be sorted by either adding more predicates to the `WHERE` clause, increasing the available memory, or increasing the value specified by the `MEM_LIMIT` query option.
Slow Write Speed	Indicates that the query's write speed is slower than expected. This health check is triggered when the difference between the actual write time and the expected write time is more than 20% of the query execution time. important If the workload is accessing data stored on Amazon S3 this condition may be triggered. Slow HDFS scanning is a known limitation of this storage platform.	This condition may be caused when overly complex expressions are used, too many columns are specified, or too many rows are requested from the `SELECT` list. Depending on the cause, to address this condition consider the following: If the cause is from overly complex expressions, reduce the complexity by simplifying the expressions. If the cause is from too many specified columns, reduce the number of columns. If the cause is from requesting too many rows in the `SELECT` list, reduce the complexity of the `SELECT` list expression.

Query/Schema design health checks

The query/schema design metrics determine whether the query contains inefficient code.

Table 5. Query/Schema Design
Health Check	Description	Recommendation
Insufficient Partitioning	Indicates that there is an insufficient number of partitions to enable parallel processing. This health check is triggered when the system reads rows that are not required for the query's operation, which increases the query's run-time duration and depletes resources.	To address this condition, consider: Adding filters to your query for existing partitioned columns. Using your more popular filters as partition keys. For example, if you have multiple queries that use the ship date as a filter, consider creating partitions where the ship date is the partition key. For more information, see the Impala documentation.
Many Materialized Columns	Indicates that an unusually large number of columns were returned for the query. This health check is triggered when the query reads more than 15 columns. note This health check is for Parquet tables only.	To address this condition, consider rewriting the query to return 15 columns or less.

Skew health checks

The skew metrics compare the performance of the query's operations to other operations within the same job. For optimal performance, operations within the same job should perform the same amount of processing.

Table 6. Skew
Health Check	Description	Recommendation
Bytes Read Skew	Indicates that one of the cluster nodes is reading a significantly larger amount of data than the other nodes in the cluster.	To address this condition, consider rebalancing the data or using the Impala `SCHEDULE_RANDOM_REPLICA` query option. For more information, see the Impala documentation.
Duration Skew	Indicates that one or more cluster nodes are taking longer to execute the query than others. The skew indicates an uneven distribution of data across cluster nodes. The more evenly the data is distributed, the faster the operations will run on the cluster. Operations that use JOINS and GROUP BY clauses may require rewriting the query or changing the underlying data partitioning to use columns with the most evenly distributed values.	To address this condition, as a starting point, consider configuring the query so that its processing is distributed evenly across operations.