Chapter 5. Using the Cost-Based Optimizer for Optimal Performance

Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query.

The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. After parsing, query gets converted to a logical tree (Abstract Syntax Tree) that represents the operations that the query must perform, such as reading a particular table or performing an inner JOIN.

Calcite applies various optimizations such as query rewrite, JOIN reordering, deriving implied predicates and JOIN elimination to produce logically equivalent plans. The current model prefers bushy plans for maximum parallelism. Each logical plan is assigned a cost based in number of distinct value based heuristics.

Calcite has an efficient plan pruner that can select the cheapest query plan. The chosen logical plan is then converted by Hive to a physical operator tree, optimized and converted to Tez jobs, then executed on the Hadoop cluster.

Statistics

Column and table statistics must be calculated for optimal Hive performance because they are critical for estimating predicate selectivity and cost of the plan. In the absence of table statistics, Hive CBO is turned off. Certain advanced rewrites require column statistics. For best results, both types of statistics are recommended.

	Important
	If table statistics are not generated, Hive CBO is turned off.

	Tip
	Enable both column and table statistics for best query performance.

Table 5.1. Commands for Gathering Column and Table Statistics

Purpose	Command
Gathering table statistics for non-partitioned tables	*ANALYZE TABLE [`table_name`] COMPUTE STATISTICS;*
Gathering table statistics for partitioned tables	*ANALYZE TABLE [`table_name`] PARTITION(`partition_column`) COMPUTE STATISTICS;*
Gathering column statistics	ANALYZE TABLE [`table_name`] COMPUTE STATISTICS for COLUMNS [`comma_separated_column_list`]; ANALYZE TABLE [`table_name`] PARTITION(`partition_column`) COMPUTE STATISTICS for COLUMNS [`comma_separated_column_list`];
Gathering statistics for newly added `partition2` on a table partitioned on `col1` with key `x`	ANALYZE TABLE partition2 (col1="x") COMPUTE STATISTICS for COLUMNS;

SQL Optimization and Planning Properties

Ambari has a configuration wizard that automatically tunes some of the optimization- and planner-related configuration properties of Hive, Tez, and YARN.

	Tip
	In most cases, do not change the settings for properties that have Auto-tuned default settings listed in the following table. The values that are set for these properties are calculated by your cluster profile and rarely need to be overwritten.

Table 5.2. Settings for Optimization and Planning Properties

Property	Setting Guideline If Manual Configuration Is Needed	Default Value in Ambari
`hive.auto.convert.join.` `noconditionaltask.size`	one-third of `-Xmx` value	Auto-tuned: Depends on environment
`hive.cbo.enable`	`true`	`true`
`hive.tez.container.size`	Production Systems: 4 to 8 GB Small VMs: 1 to 2 GB	Auto-tuned: Depends on environment
`hive.tez.java.opts`	`-Xmx` value must be 80% to 90% of container size	Auto-tuned: Depends on environment
`tez.grouping.min.size`	Decrease for better latency Increase for more throughput	`16777216`
`tez.grouping.max.size`	Decrease for better latency Increase for more throughput	`1073741824`
`tez.grouping.split-waves`	Increase to launch more containers Decrease to enhance multitenancy	`1.7`
`yarn.scheduler.minimum-allocation-mb`	1 GB is usually sufficient	Auto-tuned: Depends on environment