Apache Hive Performance TuningPDF version

Query results cache

Hive filters and caches similar or identical queries in the query result cache. Caching repetitive queries can reduce the load substantially when hundreds or thousands of users of BI tools and web services query Hive.

Some operations support hundreds of thousands of users who connect to Hive using BI systems such as Tableau. In these situations, repetitious queries are inevitable. The query result cache, which is on by default, filters and stores common queries in a cache. When you issue the query again, Hive retrieves the query result from a cache instead of recomputing the result, which takes a load off the backend system.

Every query that runs in Hive 3 stores its result in a cache. Hive evicts invalid data from the cache if the input table changes. For example, if you preform aggregation and the base table changes, queries you run most frequently stay in cache, but stale queries are evicted. The query result cache works with managed tables only because Hive cannot track changes to an external table. If you join external and managed tables, Hive falls back to executing the full query. The query result cache works with ACID tables. If you update an ACID table, Hive reruns the query automatically.

You can enable and disable the query results cache from command line. You might want to do so to debug a query. You disable the query results cache in HiveServer by setting the following parameter to false: SET hive.query.results.cache.enabled=false;

By default, Hive allocates 2GB for the query results cache. You can change this setting by configuring the following parameter in bytes: hive.query.results.cache.max.size