3. New Feature: Cost-based SQL Optimization

Hive 0.13.0 introduced cost-based optimization, or CBO, of SQL queries. CBO uses statistics about Hive tables, table partitions, and columns within a table to produce good query execution plans. More efficient query plans better utilize cluster resources and improve query latency. CBO is most useful for complex queries containing multiple JOIN statements and for queries on very large tables.

[Note]Note

Tables are not required to have partitions to generate CBO statistics. Column-level CBO statistics can be generated by both partitioned and unpartitioned tables.

CBO currently generates the following statistics:

Statistics GranularityDescription
Table-level

- Uncompressed size of table

- Number of rows

- Number of files

Column-level

- Number of distinct values

- Number of NULL values

- Minimum value

- Maximum value

CBO requires column-level statistics to generate the best query execution plans. Later, when viewing these statistics from the command line, you can can choose to also include table-level statistics generated by the hive.stats.autogather configuration property. However, CBO does not use these table-level statistics to generate query execution plans.

[Note]Note

See Statistics in Hive for more information.

Enabling Cost-based SQL Optimization

Hortonworks recommends that administrators always enable CBO. Set and verify the following configuration parameters in hive-site.xml to enable cost-based optimization of SQL queries:

 

Table 1.1. CBO Configuration Parameters

CBO Configuration ParameterDescriptionDefault Value

hive.cbo.enable

Enables cost-based query optimization.

False

hive.stats.autogather

Enables automated gathering of table-level statistics for newly created tables and table partitions, such as tables created with the INSERT OVERWRITE statement. The parameter does not produce column-level statistics, such as those generated by CBO. If disabled, administrators must manually generate these table-level statistics with the ANALYZE TABLE statement.

True


The following configuration properties are not specific to CBO, but setting them to true will also improve the performance of queries that generate statistics:

Configuration ParameterDescriptionDefault Value

hive.stats.fetch. column.stats

Instructs Hive to collect column-level statistics.

False

hive.compute.query. using.stats

Instructs Hive to use statistics when generating query plans.

False

Generating Statistics

Use the ANALYZE TABLE command to generate statistics for tables and columns. Use the optional NoScan clause to improve query performance by preventing a scan of files on HDFS. This option gathers only the following statistics:

  • Number of files

  • Size of files in bytes

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NOSCAN];

The following example views statistics for all partitions in the employees table. The query also uses the NoScan clause to improve performance:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS 

Generating Column-level Statistics:

Use the following syntax to generate statistics for columns in the employee table:

ANALYZE TABLE tablename [PARTITION(partcol1[1=val1], partcol2[=val2], ...)] COMPUTE STATISTICS FOR COLUMNS [NOSCAN]; 

The following example generates statistics for all column in the employees table:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS; 

Viewing Statistics

Use the DESCRIBE statement to view statistics generated by CBO. Include the EXTENDED keyword if you want to include statistics gathered when the hive.stats.fetch.column.stats and hive.compute.query.using.stats properties are enabled.

Viewing Generated Table Statistics

Use the following syntax to generate table statistics:

DESCRIBE [EXTENDED] tablename; 
[Note]Note

The EXTENDED keyword can be used only if the hive.stats.autogather property is enabled in the hive-site.xml configuration file.

The following example displays all statistics for the employees table:

DESCRIBE EXTENDED employees; 

Viewing Generated Column Statistics

Use the following syntax to generate column statistics:

DESCRIBE FORMATTED [dbname.]tablename.columnname; 

The following example displays statistics for the region column in the employees table:

DESCRIBE FORMATTED employees.region;