Hive 0.13.0 introduced cost-based optimization, or CBO, of SQL queries. CBO uses statistics about Hive tables, table partitions, and columns within a table to produce good query execution plans. More efficient query plans better utilize cluster resources and improve query latency. CBO is most useful for complex queries containing multiple JOIN statements and for queries on very large tables.
Note | |
---|---|
Tables are not required to have partitions to generate CBO statistics. Column-level CBO statistics can be generated by both partitioned and unpartitioned tables. |
CBO currently generates the following statistics:
Statistics Granularity | Description |
---|---|
Table-level | - Uncompressed size of table - Number of rows - Number of files |
Column-level | - Number of distinct values - Number of NULL values - Minimum value - Maximum value |
CBO requires column-level statistics to generate the best query execution plans. Later, when viewing these statistics from the command line, you can can choose to also include table-level statistics generated by the hive.stats.autogather
configuration property. However, CBO does not use these table-level statistics to generate query execution plans.
Note | |
---|---|
See Statistics in Hive for more information. |
Enabling Cost-based SQL Optimization
Hortonworks recommends that administrators always enable CBO. Set and verify the following configuration parameters in hive-site.xml to enable cost-based optimization of SQL queries:
Table 1.1. CBO Configuration Parameters
CBO Configuration Parameter | Description | Default Value |
---|---|---|
| Enables cost-based query optimization. | False |
| Enables automated gathering of table-level statistics for newly created tables and table partitions, such as tables created with the INSERT OVERWRITE statement. The parameter does not produce column-level statistics, such as those generated by CBO. If disabled, administrators must manually generate these table-level statistics with the ANALYZE TABLE statement. | True |
The following configuration properties are not specific to CBO, but setting them to true
will also improve the performance of queries that generate statistics:
Configuration Parameter | Description | Default Value |
---|---|---|
| Instructs Hive to collect column-level statistics. | False |
| Instructs Hive to use statistics when generating query plans. | False |
Generating Statistics
Use the ANALYZE TABLE command to generate statistics for tables and columns. Use the optional NoScan clause to improve query performance by preventing a scan of files on HDFS. This option gathers only the following statistics:
Number of files
Size of files in bytes
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NoScan];
The following example views statistics for all partitions in the employees table. The query also uses the NoScan clause to improve performance:
ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS [NoScan];
Generating Column-level Statistics:
Use the following syntax to generate statistics for columns in the employee table:
ANALYZE TABLE tablename [PARTITION(partcol1[1=val1], partcol2[=val2], ...)] COMPUTE STATISTICS FOR COLUMNS [NoScan];
The following example generates statistics for all columns in the employees table:
ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;
Viewing Statistics
Use the DESCRIBE statement to view statistics generated by CBO. Include the EXTENDED keyword if you
want to include statistics gathered when the hive.stats.fetch.column.stats
and
hive.compute.query.using.stats
properties are enabled.
Viewing Generated Table Statistics
Use the following syntax to generate table statistics:
DESCRIBE [EXTENDED] tablename;
Note | |
---|---|
The EXTENDED keyword can be used only if the |
The following example displays all statistics for the employees table:
DESCRIBE EXTENDED employees;
Viewing Generated Column Statistics
Use the following syntax to generate column statistics:
DESCRIBE FORMATTED [dbname.]tablename.columnname;
The following example displays statistics for the region column in the employees table:
DESCRIBE FORMATTED employees.region;