SQL Optimization
Cost-based optimization (CBO) of SQL queries is supported in Hive 0.13.0 and later. CBO uses Hive table, table partition, and column statistics to create efficient query execution plans. Efficient query plans better utilize cluster resources and improve query latency. CBO is most useful for complex queries that contain multiple JOIN statements and for queries on very large tables.
Note | |
---|---|
Tables are not required to have partitions to generate CBO statistics. Column-level CBO statistics can be generated by both partitioned and unpartitioned tables. |
CBO generates the following statistics:
Statistics Granularity | Description |
---|---|
Table-level | - Uncompressed size of table - Number of rows - Number of files |
Column-level | - Number of distinct values - Number of NULL values - Minimum value - Maximum value |
CBO requires column-level statistics to generate the best query execution plans.
Later, when viewing these statistics from the command line, you can choose to also include
table-level statistics that are generated by the hive.stats.autogather
configuration
property. However, CBO does not use these table-level statistics to generate query execution
plans.