Tuning the metastore

Generally, you need to limit concurrent connections to Hive metastore. As the number of open connections increases, so does latency. Issues with the backend database, improper Hive use, such as extremely complex queries, a connection leak, and other factors can affect performance.

General Metastore Tuning in CDP Data Center

Try making the following changes to tune HMS performance:

  • Buy an SSD for one or more Hive metastores.
  • Cloudera recommends that a single query access no more than 10,000 table partitions. If the query joins tables, calculate the combined partition count accessed across all tables.
  • Tune the backend (the RDBMS). HiveServer connects to HMS, and only HMS connects to the RDBMS. The longer the backend takes, the more memory the HMS needs to respond to the same requests. Limit the number of connections in the backend database.

    MySQL: For example, in /etc/my.cnf:

    [mysqld]
            datadir=/var/lib/mysql
            max_connections=8192
            . . .

    MariaDB: For example, in /etc/systemd/system/mariadb.service.d/limits.conf:

    [Service]
            LimitNOFILE=24000
            . . .
  • Use default thrift properties (8K):
    hive.server2.async.exec.threads 8192
    hive.server2.async.exec.wait.queue.size 8192
    hive.server2.thrift.max.worker.threads 8192
  • Set datanucleus.connectionPool.maxPoolSize for your applications. For example, if poolSize = 100, with 3 HMS instances (one dedicated to compaction), and with 4 pools per server, you can accommodate 1200 connections.