Table and column statistics

Table and column statistics help Impala generate optimal query plans using table sizes and the degree of cardinality in columns.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

The statistics information is stored in the Hive metastore database. You run the COMPUTE STATS statement to collect and set table-level and partition-level row counts as well as column statistics for a particular table. However, running this statement is very CPU-intensive. Based on the number of rows, number of data files, the total size of the data files, and the file format that are involved, Cloudera recommends the following strategies for computing statistics:

  • Missing stats can be identified by using the following SHOW commands:

    SHOW TABLE STATS [database_name.]table_name
    SHOW COLUMN STATS [database_name.]table_name

    For more information about running SHOW commands, see SHOW Statement.

  • Table and column statistics are also recorded in the query profile as shown below:

  • Do not compute statistics for tables that are not used in joins or for Impala queries. For example, statistics on a staging table that feeds into a data warehouse layer table one-to-one with no joins does not require statistics.
  • Limit running COMPUTE STATS only to columns that are involved in filters, join conditions, or GROUP BY clauses. For example, running COMPUTE STATS in the following situation makes sense:

    COMPUTE STATS wide_table [ join_column_a, join_column_b ]
  • Rerun COMPUTE STATS only when there is over 30% data change in the data.
  • Cloudera recommends that you run COMPUTE STATS in off-peak hours, on weekends, or at night.
  • When you reload a complete table or partition, where the number of rows and distinct values for each column is relatively unchanged from before, there is no need to rerun COMPUTE STATS.

Also see Table and Column Statistics for more information about table and column statistics.