Automatic Row Count Estimation

To optimize complex or multi-table queries Impala has access to statistics about the volume of data and how the values are distributed. Impala uses this information to help parallelize and distribute the work for a query.

New Default Behavior

The Impala query planner can make use of statistics about entire tables and partitions. This information includes physical characteristics such as the number of rows, number of data files, the total size of the data files, and the file format. For partitioned tables, the numbers are calculated per partition, and as totals for the whole table. This metadata is stored in the Metastore database, and can be updated by either Impala or Hive.

If there are no statistics available on a table, Impala estimates the cardinality by estimating the size of table based on the number of rows in the table. This behavior is switched on by default and should result in better plans for most cases when statistics are not available.

For some edge cases, it is possible that Impala generates a bad plan (when compared to the same query in CDH) when the statistics are not present on that table and could negatively affect the query performance.

Steps to switch to the CDH behavior:

Set the DISABLE_HDFS_NUM_ROWS_ESTIMATE query option to TRUE to disable this optimization.