Setting up the cost-based optimizer and statistics

You can use the cost-based optimizer (CBO) and statistics to develop efficient query execution plans that can improve performance. You must generate column statistics to make CBO functional.

In this task, you enable and configure the cost-based optimizer (CBO) and configure Hive to gather column statistics as well as table statistics for evaluating query performance. Column and table statistics are critical for estimating predicate selectivity and the cost of the plan. Certain advanced rewrites require column statistics.

In this task, you check, and set the following properties:

  • hive.stats.autogather

    Controls collection of table-level statistics.

  • hive.stats.fetch.column.stats

    Controls collection of column-level statistics.

  • hive.compute.query.using.stats

    Instructs Hive to use statistics when generating query plans.

You can manually generate the table-level statistics for newly created tables and table partitions using the ANALYZE TABLE statement.

  • The following components are running:
    • HiveServer
    • Hive Metastore
    • Hive clients
  • Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
  1. In Cloudera Manager, click Clusters > Hive On Tez > Configuration.
  2. In the Configuration page, search for the hive.cbo.enable property.
    If the property is not visible in your version of Cloudera Manager, add the property to Hive site using the Cloudera Manager Safety Valve (see links below). Set the property to enabled.
  3. Accept the default (enabled), or check to enable the hive.cbo.enable property for the HiveServer Default Group.
  4. Search for and enable, if necessary, hive.stats.fetch.column.stats.
  5. Search for and enable, if necessary, hive.compute.query.using.stats.
  6. Click Actions > Restart to restart the Hive on Tez service.