Table and column statistics
Table and column statistics help Impala generate optimal query plans using table sizes and the degree of cardinality in columns.
The statistics information is stored in the Hive metastore database. You
COMPUTE STATS statement to collect and set
table-level and partition-level row counts as well as column statistics for
a particular table. However, running this statement is very CPU-intensive.
Based on the number of rows, number of data files, the total size of the
data files, and the file format that are involved, Cloudera recommends the
following strategies for computing statistics:
Missing stats can be identified by using the following
SHOW TABLE STATS [database_name.]table_name SHOW COLUMN STATS [database_name.]table_name
For more information about running
SHOWcommands, see SHOW Statement.
Table and column statistics are also recorded in the query profile as shown below:
- Do not compute statistics for tables that are not used in joins or for Impala queries. For example, statistics on a staging table that feeds into a data warehouse layer table one-to-one with no joins does not require statistics.
COMPUTE STATSonly to columns that are involved in filters, join conditions, or
GROUP BYclauses. For example, running
COMPUTE STATSin the following situation makes sense:
COMPUTE STATS wide_table [ join_column_a, join_column_b ]
COMPUTE STATSonly when there is over 30% data change in the data.
- Cloudera recommends that you run
COMPUTE STATSin off-peak hours, on weekends, or at night.
- When you reload a complete table or partition, where the number of rows
and distinct values for each column is relatively unchanged from before,
there is no need to rerun
Also see Table and Column Statistics for more information about table and column statistics.