Chapter 5. Using the Cost-Based Optimizer to Enhance Performance
Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query.
The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. After parsing, a query gets converted to a logical tree (Abstract Syntax Tree) that represents the operations that the query must perform, such as reading a particular table or performing an inner JOIN.
Calcite applies various optimizations such as query rewrite, JOIN reordering, and deriving implied predicates and JOIN elimination to produce logically equivalent plans. The current model prefers bushy plans for maximum parallelism. Each logical plan is assigned a cost based in number of distinct value based heuristics.
Calcite has an efficient plan pruner that can select the cheapest query plan. The chosen logical plan is then converted by Hive to a physical operator tree, optimized and converted to Tez jobs, and then executed on the Hadoop cluster.
Enabling Cost-Based Optimization
About this Task
Turning on CBO is highly recommended.
Prerequisite
You must have administrator previleges.
Steps
In Ambari, open
Services
>Hive
> Configs tab.Refer to the following table for the properties that enable CBO and assist with generating table statistics, along with the required property settings.
You can view the properties by either of these methods:
Type each property name in the Filter field in the top right corner. Open the General, Advanced hive-env, etc., sections and scan the lists of each category. Click
.If prompted to restart, restart the Hive Service.
Table 5.1. CBO Configuration Parameters
Configuration Parameter | Setting to Enable CBO | Description |
---|---|---|
| true |
Enables cost-based query optimization. |
| true |
Enables automated gathering of table-level statistics for newly created tables and table partitions, such as tables created with the INSERT OVERWRITE statement. The parameter does not produce column-level statistics, such as those generated by CBO. If disabled, administrators must manually generate the table-level statistics for newly generated tables and table partitions with the ANALYZE TABLE statement. |
Statistics
Tip | |
---|---|
Gather both column and table statistics for best query performance. |
Column and table statistics must be calculated for optimal Hive performance because they are critical for estimating predicate selectivity and cost of the plan. In the absence of table statistics, Hive CBO does not function. Certain advanced rewrites require column statistics.
Ensure that the configuration properties in the following table are set to true
to
improve the performance of queries that generate statistics. You can set the properties using
Ambari or by customizing the hive-site.xml
file.
Configuration Parameter | Setting to Enable Statistics | Description |
---|---|---|
| true |
Instructs Hive to collect column-level statistics. |
| true |
Instructs Hive to use statistics when generating query plans. |
Generating Hive Statistics
The ANALYZE TABLE command generates statistics for tables and columns. The following lines show how to generate different types of statistics on Hive objects.
Gathering table statistics for non-partitioned tables
ANALYZE TABLE [table_name
] COMPUTE STATISTICS;
Gathering table statistics for partitioned tables
ANALYZE TABLE [table_name
] PARTITION(partition_column
) COMPUTE STATISTICS;
Gathering column statistics for the entire table
ANALYZE TABLE [table_name
] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list
];
Gathering statistics for the partition2
column on a table partitioned
on col1
with key x
ANALYZE TABLE partition2 (col1="x") COMPUTE STATISTICS for COLUMNS;
Viewing Generated Statistics
Use the DESCRIBE statement to view statistics generated by CBO. Include the EXTENDED keyword if you
want to include statistics gathered when the hive.stats.fetch.column.stats
and
hive.compute.query.using.stats
properties are enabled.
Viewing Table Statistics
Use the following syntax to view table statistics:
DESCRIBE [EXTENDED]
table_name
;Note The EXTENDED keyword can be used only if the
hive.stats.autogather
property is enabled in thehive-site.xml
configuration file.The following example displays all statistics for the employees table:
DESCRIBE EXTENDED employees;
If the table statistics are up-to-date, the output includes the following table parameter information:
{\"BASIC_STATS\":\"true\",\" ...
Viewing Column Statistics
Use the following syntax to view column statistics:
DESCRIBE FORMATTED [
db_name
.]table_name
.column_name
;The following example displays statistics for the region column in the employees table:
DESCRIBE FORMATTED employees.region;
If the table statistics are up-to-date, the output includes the following table parameter information:
COLUMN_STATS_ACCURATE
Note | |
---|---|
See Statistics in Hive on the Apache website for more information. |
SQL Optimization and Planning Properties
Ambari has a configuration wizard that automatically tunes some of the optimization- and planner-related configuration properties of Hive, Tez, and YARN.
Tip | |
---|---|
In most cases, do not change the settings for properties that have Auto-tuned default settings listed in the following table. The values that are set for these properties are calculated by your cluster profile and rarely need to be overwritten. |
Table 5.2. Settings for Optimization and Planning Properties
Property | Setting Guideline If Manual Configuration Is Needed | Default Value in Ambari |
---|---|---|
| one-third of | Auto-tuned: Depends on environment |
| Production Systems: 4 to 8 GB Small VMs: 1 to 2 GB | Auto-tuned: Depends on environment |
|
| Auto-tuned: Depends on environment |
| Decrease for better latency Increase for more throughput |
|
| Decrease for better latency Increase for more throughput |
|
| Increase to launch more containers Decrease to enhance multitenancy |
|
| 1 GB is usually sufficient | Auto-tuned: Depends on environment |