Apache Hive Performance Tuning
Also available as:
PDF

Set up the cost-based optimizer and statistics

You can use the cost-based optimizer (CBO) and statistics to generate efficient query execution plans that can improve performance. You must generate column statistics to make CBO functional.

In this task, you enable and configure the cost-based optimizer (CBO) and configure Hive to gather column and table statistics for evaluating query performance. Column and table statistics are critical for estimating predicate selectivity and cost of the plan. Certain advanced rewrites require column statistics.

In this task, you check, and set the following properties in the hive-site.xml configuration file:

  • hive.stats.autogather

    Controls collection of table-level statistics.

  • hive.stats.fetch.column.stats

    Controls collection of column-level statistics.

  • hive.compute.query.using.stats

    Instructs Hive to use statistics when generating query plans.

All of these properties are checked by default. You can manually generate the table-level statistics for newly created tables and table partitions using the ANALYZE TABLE statement.

  • You installed Ambari.
  • You added the Apache Hive service and started all components.
  • You have administrative privileges to configure Hive in Ambari.
  1. In Ambari, select Services > Hive > Configs.
  2. Enable cost-based optimization if you changed the default: In Filter, enter hive.cbo.enable, and check the checkbox.
  3. Configure automatic gathering of table-level statistics for newly created tables and table partitions if you changed the default: In Filter, enter hive.stats.autogather, and check the checkbox.
  4. Configure Hive to use statistics when generating query plans: In Filter, enter hive.compute.query.using.stats, and check the checkbox.
  5. Restart Hive and any other affected services.