3. New Feature: Cost-based SQL Optimization

Hive 0.13.0 introduced cost-based optimization, or CBO, of SQL queries. CBO uses statistics about Hive tables, table partitions, and columns within a table to produce good query execution plans. More efficient query plans better utilize cluster resources and improve query latency. CBO is most useful for complex queries containing multiple JOIN statements and for queries on very large tables.


Tables are not required to have partitions to generate CBO statistics. Column-level CBO statistics can be generated by both partitioned and unpartitioned tables.

CBO currently generates the following statistics:

Statistics GranularityDescription

- Uncompressed size of table

- Number of rows

- Number of files


- Number of distinct values

- Number of NULL values

- Minimum value

- Maximum value

CBO requires column-level statistics to generate the best query execution plans. Later, when viewing these statistics from the command line, you can can choose to also include table-level statistics generated by the hive.stats.autogather configuration property. However, CBO does not use these table-level statistics to generate query execution plans.


See Statistics in Hive for more information.

Enabling Cost-based SQL Optimization

Hortonworks recommends that administrators always enable CBO. Set and verify the following configuration parameters in hive-site.xml to enable cost-based optimization of SQL queries:


Table 1.1. CBO Configuration Parameters

CBO Configuration ParameterDescriptionDefault Value


Enables cost-based query optimization.



Enables automated gathering of table-level statistics for newly created tables and table partitions, such as tables created with the INSERT OVERWRITE statement. The parameter does not produce column-level statistics, such as those generated by CBO. If disabled, administrators must manually generate these table-level statistics with the ANALYZE TABLE statement.


The following configuration properties are not specific to CBO, but setting them to true will also improve the performance of queries that generate statistics:

Configuration ParameterDescriptionDefault Value

hive.stats.fetch. column.stats

Instructs Hive to collect column-level statistics.


hive.compute.query. using.stats

Instructs Hive to use statistics when generating query plans.


Generating Statistics

Use the ANALYZE TABLE command to generate statistics for tables and columns. Use the optional NoScan clause to improve query performance by preventing a scan of files on HDFS. This option gathers only the following statistics:

  • Number of files

  • Size of files in bytes

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NoScan];

The following example views statistics for all partitions in the employees table. The query also uses the NoScan clause to improve performance:


Generating Column-level Statistics:

Use the following syntax to generate statistics for columns in the employee table:

ANALYZE TABLE tablename [PARTITION(partcol1[1=val1], partcol2[=val2], ...)] COMPUTE STATISTICS FOR COLUMNS [NoScan]; 

The following example generates statistics for all columns in the employees table:


Viewing Statistics

Use the DESCRIBE statement to view statistics generated by CBO. Include the EXTENDED keyword if you want to include statistics gathered when the hive.stats.fetch.column.stats and hive.compute.query.using.stats properties are enabled.

Viewing Generated Table Statistics

Use the following syntax to generate table statistics:


The EXTENDED keyword can be used only if the hive.stats.autogather property is enabled in the hive-site.xml configuration file.

The following example displays all statistics for the employees table:


Viewing Generated Column Statistics

Use the following syntax to generate column statistics:

DESCRIBE FORMATTED [dbname.]tablename.columnname; 

The following example displays statistics for the region column in the employees table:

DESCRIBE FORMATTED employees.region;