ReferencePDF version

Impala health checks

Lists the Impala health check tests that are performed by Cloudera Observability On-Premises at the end of an Apache Impala job. They provide performance and query insights, such as pointing out queries that may be causing bottlenecks. You can find the Impala health checks on the Impala Queries page in the Health Check list.

The execution metric determines whether a job failed or passed the Cloudera Observability On-Premises health check.

Table 1. Execution
Health Check Description
Failed - Any Healthcheck Displays jobs that failed at least one health check.

The metadata/statistic metrics test the distribution of values in one or more columns of the data table for query optimization.

Table 2. Metadata/Statistics
Health Check Description Recommendation
Corrupt Table Statistics
Indicates that these queries contain table statistics that were incorrectly computed and therefore cannot be used.

To address this condition, consider recomputing the table statistics.

For more information, see the Impala documentation.

Missing Table Statistics

Indicates that no table statistics were computed for query optimization.

To address this condition, consider computing the table statistics.

For more information, see the Impala documentation.

The optimal configuration metrics determine whether the query's operation performance was impacted by insufficient resources

Table 3. Optimal Configuration
Health Check Description Recommendation
Aggregation Spilled Partitions

Indicates that during the query's aggregation operation, data was spilled to disk.

This health check is triggered when there is not enough memory to complete the operation.

To address this condition, consider:

  • Replacing the high-cardinality GROUP-BY clauses, which can lead to memory issues, with low-cardinality clauses that organize your data with fewer columns.
  • Increasing the query's memory limit setting with the MEM_LIMIT query option.
  • Adding more physical memory.

For more information, see the Impala documentation.

HashJoin Spilled Partitions

Indicates that during the query's hash join operation, data was spilled to disk.

This health check is triggered when there is not enough memory to complete the operation.

To address this condition, consider:

  • Reducing the cardinality from the right-hand side of the join by filtering more rows.
  • Increasing the query's memory limit setting with the MEM_LIMIT query option.
  • Using a denormalized table.
  • Adding more physical memory.
Slow Client

Indicates that the client consumed the query results slower than expected.

To address this condition depends on the root cause. For example:

  • If the condition is triggered because some clients are taking too long to unregister the query, consider using more appropriate clients for the workload. Such as using an interactive client rather than an ODBC or a JDBC client when testing and building SQL queries.
  • If the condition is triggered because the client is unable to close the query in a timely manner, consider using the Impala Timeout feature. Such as when your Impala job contains wait times between reading each set of rows during exploratory analysis. This example, will also deplete system resources.

    Additionally, consider limiting the number of returned rows to 100 or less by adding a LIMIT clause to your queries.

For more information about setting timeout periods for daemons, queries, and sessions, see the Impala documentation.

The performance metrics measure the query's execution times.

Table 4. Performance
Health Check Description Recommendation
Slow Aggregate

Indicates that the aggregation operations were slower than expected.

This health check is triggered when the observed throughput is less than ten million rows per second.

To address this condition depends on the root cause. For example:

  • If the root cause is from resource conflicts with other queries, consider reducing conflicts by allocating different resource pools.
  • If the root cause is from overly complex GROUP BY operations, consider rewriting the queries with simpler GROUP BY operations.
Slow Code Generation

Indicates that the compiled code was generated slower than expected.

This health check is triggered when the generation time exceeds 20% of the overall query execution time.

This condition may be triggered due to an overly complex query. For example, if the query has too many predicates in its WHERE clause, contains too many joins, or contains too many columns.

To address this condition, consider using the DISABLE_CODEGEN query option in your session.

Slow HDFS Scan
Indicates that the time taken to scan data from HDFS was slower than expected.
This condition is caused by either a slow disk, extremely complex scan predicates, or a busy HDFS NameNode.

Depending on the cause, to address this condition consider the following:

  • If the cause is a slow disk, replace the disk.
  • If the cause is through complex scan predicates, reduce the complexity by simplifying the scan predicates.
  • If the cause is due to a busy HDFS NameNode, consider upgrading.
Slow Hash Join

Indicates that the hash join operations were slower than expected.

This health check is triggered when the observed throughput is less than five million rows per second.

This condition may be triggered when there are overly complex join predicates or a hash join is causing data to spill to disk.

To address this condition, consider simplifying the join predicates or reducing the size on the right-hand side of the join.

Slow Query Planning

Indicates that the query plan generated slower than expected.

This health check is triggered when the query planning time exceeds 30% of the overall query execution time.

This condition may be caused by overly complex queries or if a metadata refresh occurred whilst the query was executing.

To address this condition, consider simplifying your queries. For example, reduce the number of columns returned, reduce the number of filters, or reduce the number of joins.

Slow Row Materialization

Indicates that rows were returned slower than expected.

This health check is triggered when it takes more than 20% of the query execution time to return rows.

This condition may be caused when overly complex expressions are used in the SELECT list or when too many rows are requested.

To address this condition, simplify the query by either reducing the number of columns in the selected list or reducing the number of requested rows.

Slow Sorting

Indicates that the sorting operations were slower than expected.

This health check is triggered when the observed throughput is less than ten million rows per second.

To address this condition, consider the following:

  • Simplify the ORDER BY clause in your queries.
  • If data is spilling to disk, reduce the amount of data to be sorted by either adding more predicates to the WHERE clause, increasing the available memory, or increasing the value specified by the MEM_LIMIT query option.
Slow Write Speed

Indicates that the query's write speed is slower than expected.

This health check is triggered when the difference between the actual write time and the expected write time is more than 20% of the query execution time.

This condition may be caused when overly complex expressions are used, too many columns are specified, or too many rows are requested from the SELECT list.

Depending on the cause, to address this condition consider the following:

  • If the cause is from overly complex expressions, reduce the complexity by simplifying the expressions.
  • If the cause is from too many specified columns, reduce the number of columns.
  • If the cause is from requesting too many rows in the SELECT list, reduce the complexity of the SELECT list expression.

The query/schema design metrics determine whether the query contains inefficient code.

Table 5. Query/Schema Design
Health Check Description Recommendation
Insufficient Partitioning

Indicates that there is an insufficient number of partitions to enable parallel processing.

This health check is triggered when the system reads rows that are not required for the query's operation, which increases the query's run-time duration and depletes resources.

To address this condition, consider:

  • Adding filters to your query for existing partitioned columns.
  • Using your more popular filters as partition keys. For example, if you have multiple queries that use the ship date as a filter, consider creating partitions where the ship date is the partition key.

For more information, see the Impala documentation.

Many Materialized Columns

Indicates that an unusually large number of columns were returned for the query.

This health check is triggered when the query reads more than 15 columns.

To address this condition, consider rewriting the query to return 15 columns or less.

The skew metrics compare the performance of the query's operations to other operations within the same job. For optimal performance, operations within the same job should perform the same amount of processing.

Table 6. Skew
Health Check Description Recommendation
Bytes Read Skew

Indicates that one of the cluster nodes is reading a significantly larger amount of data than the other nodes in the cluster.

To address this condition, consider rebalancing the data or using the Impala SCHEDULE_RANDOM_REPLICA query option.

For more information, see the Impala documentation.

Duration Skew Indicates that one or more cluster nodes are taking longer to execute the query than others.

The skew indicates an uneven distribution of data across cluster nodes. The more evenly the data is distributed, the faster the operations will run on the cluster. Operations that use JOINS and GROUP BY clauses may require rewriting the query or changing the underlying data partitioning to use columns with the most evenly distributed values.

To address this condition, as a starting point, consider configuring the query so that its processing is distributed evenly across operations.