Chapter 5. Using the Cost-Based Optimizer to Enhance Performance

Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query.

The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. After parsing, a query gets converted to a logical tree (Abstract Syntax Tree) that represents the operations that the query must perform, such as reading a particular table or performing an inner JOIN.

Calcite applies various optimizations such as query rewrite, JOIN reordering, and deriving implied predicates and JOIN elimination to produce logically equivalent plans. The current model prefers bushy plans for maximum parallelism. Each logical plan is assigned a cost based in number of distinct value based heuristics.

Calcite has an efficient plan pruner that can select the cheapest query plan. The chosen logical plan is then converted by Hive to a physical operator tree, optimized and converted to Tez jobs, and then executed on the Hadoop cluster.

Enabling Cost-Based Optimization

About this Task

Turning on CBO is highly recommended.

Prerequisite

You must have administrator previleges.

Steps

In Ambari, open Services > Hive > Configs tab.
Refer to the following table for the properties that enable CBO and assist with generating table statistics, along with the required property settings.
You can view the properties by either of these methods:
Type each property name in the Filter field in the top right corner.
Open the General, Advanced hive-env, etc., sections and scan the lists of each category.
Click Save.
If prompted to restart, restart the Hive Service.

Table 5.1. CBO Configuration Parameters

Configuration Parameter	Setting to Enable CBO	Description
`hive.cbo.enable`	`true`	Enables cost-based query optimization.
`hive.stats.autogather`	`true`	Enables automated gathering of table-level statistics for newly created tables and table partitions, such as tables created with the INSERT OVERWRITE statement. The parameter does not produce column-level statistics, such as those generated by CBO. If disabled, administrators must manually generate the table-level statistics for newly generated tables and table partitions with the ANALYZE TABLE statement.

Statistics

	Tip
	Gather both column and table statistics for best query performance.

Column and table statistics must be calculated for optimal Hive performance because they are critical for estimating predicate selectivity and cost of the plan. In the absence of table statistics, Hive CBO does not function. Certain advanced rewrites require column statistics.

Ensure that the configuration properties in the following table are set to true to improve the performance of queries that generate statistics. You can set the properties using Ambari or by customizing the hive-site.xml file.

Configuration Parameter	Setting to Enable Statistics	Description
`hive.stats.fetch.column.stats`	`true`	Instructs Hive to collect column-level statistics.
`hive.compute.query.using.stats`	`true`	Instructs Hive to use statistics when generating query plans.

Generating Hive Statistics

The ANALYZE TABLE command generates statistics for tables and columns. The following lines show how to generate different types of statistics on Hive objects.

Gathering table statistics for non-partitioned tables

ANALYZE TABLE [table_name] COMPUTE STATISTICS;

Gathering table statistics for partitioned tables

ANALYZE TABLE [table_name] PARTITION(partition_column) COMPUTE STATISTICS;

Gathering column statistics for the entire table

ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list];

Gathering statistics for the partition2 column on a table partitioned on col1 with key x

ANALYZE TABLE partition2 (col1="x") COMPUTE STATISTICS for COLUMNS;

Viewing Generated Statistics

Use the DESCRIBE statement to view statistics generated by CBO. Include the EXTENDED keyword if you want to include statistics gathered when the hive.stats.fetch.column.stats and hive.compute.query.using.stats properties are enabled.

Viewing Table Statistics
- Use the following syntax to view table statistics:
```
DESCRIBE [EXTENDED] table_name; 
```
  Note
  The EXTENDED keyword can be used only if the hive.stats.autogather property is enabled in the hive-site.xml configuration file.
- The following example displays all statistics for the employees table:
```
DESCRIBE EXTENDED employees; 
```
- If the table statistics are up-to-date, the output includes the following table parameter information:
```
{\"BASIC_STATS\":\"true\",\" ...
```
Viewing Column Statistics
- Use the following syntax to view column statistics:
```
DESCRIBE FORMATTED [db_name.]table_name.column_name; 
```
- The following example displays statistics for the region column in the employees table:
```
DESCRIBE FORMATTED employees.region; 
```
- If the table statistics are up-to-date, the output includes the following table parameter information:
```
COLUMN_STATS_ACCURATE 
```

	Note
	See Statistics in Hive on the Apache website for more information.

SQL Optimization and Planning Properties

Ambari has a configuration wizard that automatically tunes some of the optimization- and planner-related configuration properties of Hive, Tez, and YARN.

	Tip
	In most cases, do not change the settings for properties that have Auto-tuned default settings listed in the following table. The values that are set for these properties are calculated by your cluster profile and rarely need to be overwritten.

Table 5.2. Settings for Optimization and Planning Properties

Property	Setting Guideline If Manual Configuration Is Needed	Default Value in Ambari
`hive.auto.convert.join.` `noconditionaltask.size`	one-third of `-Xmx` value	Auto-tuned: Depends on environment
`hive.tez.container.size`	Production Systems: 4 to 8 GB Small VMs: 1 to 2 GB	Auto-tuned: Depends on environment
`hive.tez.java.opts`	`-Xmx` value must be 80% to 90% of container size	Auto-tuned: Depends on environment
`tez.grouping.min.size`	Decrease for better latency Increase for more throughput	`16777216`
`tez.grouping.max.size`	Decrease for better latency Increase for more throughput	`1073741824`
`tez.grouping.split-waves`	Increase to launch more containers Decrease to enhance multitenancy	`1.7`
`yarn.scheduler.minimum-allocation-mb`	1 GB is usually sufficient	Auto-tuned: Depends on environment

	Note
	The EXTENDED keyword can be used only if the `hive.stats.autogather` property is enabled in the `hive-site.xml` configuration file.