Apache Hive Performance Tuning
Also available as:
PDF

Use Column Statistics and the Cost-Based Optimizer (CBO)

A CBO generates more efficient query plans. In Hive, the CBO is enabled by default, but it requires that column statistics be generated for tables. Column statistics can be expensive to compute so they are not automated. Hive has a CBO that is based on Apache Calcite and an older physical optimizer. All of the optimizations are being migrated to the CBO. The physical optimizer performs better with statistics, but the CBO requires statistics.

Enabling the CBO

The CBO is enabled by default in Hive 0.14 and later. If you need to enable it manually, set the following property in hive-site.xml:

SET hive.cbo.enable=true;

For the physical optimizer, set the following properties in hive-site.xml to generate statistics:

SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;

Gathering Statistics--Critical to the CBO

The CBO requires both table-level and column-level statistics:

  • Table-level statistics:

    Table-level statistics should always be collected. Make sure the following property is set as follows in hive-site.xml to collect table-level statistics:

    SET hive.stats.autogather=true;

    If you have an existing table that does not have statistics collected, you can collect them by running the following query:

    ANALYZE TABLE <table_name> COMPUTE STATISTICS;
  • Column-level statistics (critical):

    Column-level statistics are expensive to compute and are not yet automated. The recommended process to use for Hive 0.14 and later is to compute column statistics for all of your existing tables using the following command:

    ANALYZE TABLE <table_name> COMPUTE STATISTICS for COLUMNS;

    As new partitions are added to the table, if the table is partitioned on "col1" and the new partition has the key "x," then you must also use the following command:

    ANALYZE TABLE <table_name> partition (coll="x") COMPUTE STATISTICS for COLUMNS;