Chapter 5. Using the Cost-Based Optimizer for Optimal Performance
Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query.
The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. After parsing, query gets converted to a logical tree (Abstract Syntax Tree) that represents the operations that the query must perform, such as reading a particular table or performing an inner JOIN.
Calcite applies various optimizations such as query rewrite, JOIN reordering, deriving implied predicates and JOIN elimination to produce logically equivalent plans. The current model prefers bushy plans for maximum parallelism. Each logical plan is assigned a cost based in number of distinct value based heuristics.
Calcite has an efficient plan pruner that can select the cheapest query plan. The chosen logical plan is then converted by Hive to a physical operator tree, optimized and converted to Tez jobs, then executed on the Hadoop cluster.
Statistics
Column and table statistics must be calculated for optimal Hive performance because they are critical for estimating predicate selectivity and cost of the plan. In the absence of table statistics, Hive CBO is turned off. Certain advanced rewrites require column statistics. For best results, both types of statistics are recommended.
Important | |
---|---|
If table statistics are not generated, Hive CBO is turned off. |
Tip | |
---|---|
Enable both column and table statistics for best query performance. |
Table 5.1. Commands for Gathering Column and Table Statistics
Purpose | Command |
---|---|
Gathering table statistics for non-partitioned tables | ANALYZE TABLE [ |
Gathering table statistics for partitioned tables | ANALYZE TABLE [ |
Gathering column statistics | ANALYZE TABLE [ |
Gathering statistics for newly added | ANALYZE TABLE partition2 (col1="x") COMPUTE STATISTICS for COLUMNS; |
SQL Optimization and Planning Properties
Ambari has a configuration wizard that automatically tunes some of the optimization- and planner-related configuration properties of Hive, Tez, and YARN.
Tip | |
---|---|
In most cases, do not change the settings for properties that have Auto-tuned default settings listed in the following table. The values that are set for these properties are calculated by your cluster profile and rarely need to be overwritten. |
Table 5.2. Settings for Optimization and Planning Properties
Property | Setting Guideline If Manual Configuration Is Needed | Default Value in Ambari |
---|---|---|
| one-third of | Auto-tuned: Depends on environment |
|
|
|
| Production Systems: 4 to 8 GB Small VMs: 1 to 2 GB | Auto-tuned: Depends on environment |
|
| Auto-tuned: Depends on environment |
| Decrease for better latency Increase for more throughput |
|
| Decrease for better latency Increase for more throughput |
|
| Increase to launch more containers Decrease to enhance multitenancy |
|
| 1 GB is usually sufficient | Auto-tuned: Depends on environment |