Apache Hive Performance Tuning
Also available as:
PDF

Query result cache and metastore cache

You need to understand the query result cache to enable or disable it for debugging and to configure the memory allocated for the results. You also need to understand how the metadata cache affects data consistency.

Using the query result cache

Some operations support hundreds of thousands of users who connect to Hive using BI systems such as Tableau. In these situations, repetitious queries are inevitable. The query result cache, which is on by default, filters and stores common queries in a cache. When you issue the query again, Hive retrieves the query result from a cache instead of recomputing the result, which takes a load off the backend system.

Every query that runs in Hive 3 stores its result in a cache. Hive evicts invalid data from the cache if the input table changes. For example, if you preform aggregation and the base table changes, queries you run most frequently stay in cache, but stale queries are evicted. The query result cache works with managed tables only because Hive cannot track changes to an external table. If you join external and managed tables, Hive falls back to executing the full query. The query result cache works with ACID tables. If you update an ACID table, Hive reruns the query automatically.

You can enable and disable the query result cache from command line. You might want to do so to debug a query. You disable the query result cache by setting the following parameter to false: hive.query.results.cache.enabled=false

Hive stores the query result cache in /tmp/hive/__resultcache__/. By default, Hive allocates 2GB for the query result cache. You can change this setting by configuring the following parameter in bytes: hive.query.results.cache.max.size

Using the metadata cache

When query execution time is less than 1 second, compilation time dominates. Metadata retrieval is often a significant part of compilation time, which is devoted to RDBMS queries. In this situation, Cloud RDBMS As a Service is often slower, and frequent queries leads to throttling. Metadata caching speeds compilation time by approximately 50 percent in an on-premises MySQL installation. Significantly more improvements occur with cloud RDBMS. Caching achieves consistency in a single metastore setup and eventual consistent in an HA setup.