Hive Table Statistics

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

If your cluster has Impala then you can use the Impala implementation to compute statistics. The Impala implementation to compute table statistics is available in CDH 5.0.0 or higher and in Impala version 1.2.2 or higher. The Impala implementation of COMPUTE STATS requires no setup steps and is preferred over the Hive implementation. See Overview of Table Statistics. If you are running an older version of Impala, you can collect statistics on a Hive table by running the following command from a Beeline client connected to HiveServer2:
analyze table <table name> compute statistics;
analyze table <table name> compute statistics for columns <all columns of a table>;

Configuring Hive to Store Statistics in MySQL

If you are deploying CDH 5.2 this procedure is no longer is required because Hive has another implementation of the statistics calculator.

By default, Hive writes statistics to a Derby database backed by a file named /var/lib/hive/TempStatsStore. However, in production systems Cloudera recommends that you store statistics in a database. Hive table statistics are not supported for PostgreSQL or Oracle. To configure Hive to store statistics in MySQL:
  1. Set up a MySQL server. For instructions on setting up MySQL, see MySQL Database.

    This database will be heavily loaded, so it should not be installed on the same host as anything critical such as the Hive Metastore Server, the database hosting the Hive Metastore, or Cloudera Manager Server. When collecting statistics on a large table and/or in a large cluster, this host may become slow or unresponsive.

  2. Create a statistics database in MySQL:
    mysql> create database stats_db_name DEFAULT CHARACTER SET utf8;
    Query OK, 1 row affected (0.00 sec)
    
    mysql> grant all on stats_db_name.* TO 'stats_user'@'%' IDENTIFIED BY 'stats_password';
    Query OK, 0 rows affected (0.00 sec)
  3. Add the following into the HiveServer2 Configuration Advanced Configuration Snippet for hive-site.xml property:
    <property>
      <name>hive.stats.dbclass</name>
      <value>jdbc:mysql</value>
    </property>
    <property>
      <name>hive.stats.jdbcdriver</name>
      <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
      <name>hive.stats.dbconnectionstring</name>
      <value>jdbc:mysql://<stats_mysql_host>:3306/<stats_db_name>
             ?useUnicode=true&amp;characterEncoding=UTF-8&amp;
             user=<stats_user>&amp;password=<stats_password></value>
    </property>
    <property> 
      <name>hive.aux.jars.path</name> 
      <value>file:///usr/share/java/mysql-connector-java.jar</value>
    </property>
  4. Click Save Changes to commit the changes.
  5. Restart the HiveServer2 role.