Table and column statistics
Table and column statistics help Impala generate optimal query plans using table sizes and the degree of cardinality in columns.
By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.
The statistics information is stored in the Hive metastore database. You
run the COMPUTE STATS
statement to collect and set
table-level and partition-level row counts as well as column statistics for
a particular table. However, running this statement is very CPU-intensive.
Based on the number of rows, number of data files, the total size of the
data files, and the file format that are involved, Cloudera recommends the
following strategies for computing statistics:
-
Missing stats can be identified by using the following
SHOW
commands:SHOW TABLE STATS [database_name.]table_name SHOW COLUMN STATS [database_name.]table_name
For more information about running
SHOW
commands, see SHOW Statement. -
Table and column statistics are also recorded in the query profile as shown below:
- Do not compute statistics for tables that are not used in joins or for Impala queries. For example, statistics on a staging table that feeds into a data warehouse layer table one-to-one with no joins does not require statistics.
-
Limit running
COMPUTE STATS
only to columns that are involved in filters, join conditions, orGROUP BY
clauses. For example, runningCOMPUTE STATS
in the following situation makes sense:COMPUTE STATS wide_table [ join_column_a, join_column_b ]
- Rerun
COMPUTE STATS
only when there is over 30% data change in the data. - Cloudera recommends that you run
COMPUTE STATS
in off-peak hours, on weekends, or at night. - When you reload a complete table or partition, where the number of rows
and distinct values for each column is relatively unchanged from before,
there is no need to rerun
COMPUTE STATS
.
Also see Table and Column Statistics for more information about table and column statistics.