Apache Hive Performance Tuning
Also available as:
PDF

Chapter 5. Using the Cost-Based Optimizer for Optimal Performance

Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query.

The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. After parsing, query gets converted to a logical tree (Abstract Syntax Tree) that represents the operations that the query must perform, such as reading a particular table or performing an inner JOIN.

Calcite applies various optimizations such as query rewrite, JOIN reordering, deriving implied predicates and JOIN elimination to produce logically equivalent plans. The current model prefers bushy plans for maximum parallelism. Each logical plan is assigned a cost based in number of distinct value based heuristics.

Calcite has an efficient plan pruner that can select the cheapest query plan. The chosen logical plan is then converted by Hive to a physical operator tree, optimized and converted to Tez jobs, then executed on the Hadoop cluster.

Statistics

Column and table statistics must be calculated for optimal Hive performance because they are critical for estimating predicate selectivity and cost of the plan. In the absence of table statistics, Hive CBO is turned off. Certain advanced rewrites require column statistics. For best results, both types of statistics are recommended.

[Important]Important

If table statistics are not generated, Hive CBO is turned off.

[Tip]Tip

Enable both column and table statistics for best query performance.

Table 5.1. Commands for Gathering Column and Table Statistics

PurposeCommand

Gathering table statistics for non-partitioned tables

ANALYZE TABLE [table_name] COMPUTE STATISTICS;

Gathering table statistics for partitioned tables

ANALYZE TABLE [table_name] PARTITION(partition_column) COMPUTE STATISTICS;

Gathering column statistics

ANALYZE TABLE [table_name] COMPUTE STATISTICS for COLUMNS [comma_separated_column_list]; ANALYZE TABLE [table_name] PARTITION(partition_column) COMPUTE STATISTICS for COLUMNS [comma_separated_column_list];

Gathering statistics for newly added partition2 on a table partitioned on col1 with key x

ANALYZE TABLE partition2 (col1="x") COMPUTE STATISTICS for COLUMNS;


SQL Optimization and Planning Properties

Ambari has a configuration wizard that automatically tunes some of the optimization- and planner-related configuration properties of Hive, Tez, and YARN.

[Tip]Tip

In most cases, do not change the settings for properties that have Auto-tuned default settings listed in the following table. The values that are set for these properties are calculated by your cluster profile and rarely need to be overwritten.

Table 5.2. Settings for Optimization and Planning Properties

PropertySetting Guideline If Manual Configuration Is NeededDefault Value in Ambari

hive.auto.convert.join.

noconditionaltask.size

one-third of -Xmx value

Auto-tuned: Depends on environment

hive.cbo.enable

true

true

hive.tez.container.size

Production Systems: 4 to 8 GB

Small VMs: 1 to 2 GB

Auto-tuned: Depends on environment

hive.tez.java.opts

-Xmx value must be 80% to 90% of container size

Auto-tuned: Depends on environment

tez.grouping.min.size

Decrease for better latency

Increase for more throughput

16777216

tez.grouping.max.size

Decrease for better latency

Increase for more throughput

1073741824

tez.grouping.split-waves

Increase to launch more containers

Decrease to enhance multitenancy

1.7

yarn.scheduler.minimum-allocation-mb

1 GB is usually sufficient

Auto-tuned: Depends on environment