Statistics generation and viewing commands in Cloudera Data Warehouse

You can manually generate table and column statistics, and then view statistics using Hive queries. By default, Hive generates table statistics, but not column statistics, which you must generate manually to make cost-based optimization (CBO) functional.

Commands for generating statistics

The following ANALYZE TABLE command generates statistics for tables and columns:

ANALYZE TABLE [table_name] COMPUTE STATISTICS;
Gathers table statistics for non-partitioned tables.
ANALYZE TABLE [table_name] PARTITION(partition_column) COMPUTE STATISTICS;
Gathers table statistics for partitioned tables.
ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list];
Gathers column statistics for the entire table.
ANALYZE TABLE partition2 (col1="x") COMPUTE STATISTICS for COLUMNS;
Gathers statistics for the partition2 column on a table partitioned on col1 with key x.

Commands for viewing statistics

You can use the following commands to view table and column statistics:

DESCRIBE [EXTENDED] table_name;
View table statistics. The EXTENDED keyword can be used only if the hive.stats.autogather property is enabled in the hive-site.xml configuration file. Use the Cloudera Manager Safety Valve feature (see link below).
DESCRIBE FORMATTED [db_name.]table_name [column_name] [PARTITION (partition_spec)];
View column statistics.

Console warnings for small files

The MSCK and ANALYZE commands display a warning message in the console if the average file size for a table or partition is below the threshold. The average file size is calculated by dividing the total size by the number of files (totalSize / numFiles). If this value is less than the avgFileSize threshold, a warning is displayed so you can identify potential performance issues caused by small files.

Hive displays small-file warnings during MSCK REPAIR TABLE or ANALYZE TABLE operations to help you identify performance risks associated with small files.

The following example shows the warning message when you run the MSCK REPAIR TABLE command:

hive> MSCK REPAIR TABLE sales;
Partitions not in metastore: sales/dt=2025-01-01, sales/dt=2025-01-02
Repair: Added partition to metastore sales/dt=2025-01-01
Repair: Added partition to metastore sales/dt=2025-01-02
[MSCK] Small files detected.
[MSCK] Average file size is too small, small files exist.
Partition name: dt=2025-01-01. Small files detected: partition dt=2025-01-01 (avgBytes=2048, files=5000, totalBytes=10240000)
[MSCK] Average file size is too small, small files exist.
Partition name: dt=2025-01-02. Small files detected: partition dt=2025-01-02 (avgBytes=1024, files=8000, totalBytes=8192000)
OK

The following example shows the warning message when you run the ANALYZE TABLE command:

hive> ANALYZE TABLE sales PARTITION(dt='2025-01-01') COMPUTE STATISTICS;
...
[ANALYZE] Small files detected: partition dt=2025-01-01 (avgBytes=2048, files=5000, totalBytes=10240000)
OK