Setting Up HDFS Caching

Set up HDFS caching with Impala for improved performance.

Decide how much memory to devote to the HDFS cache on each host. The total memory available for cached data is the sum of the cache sizes on all the hosts. By default, any data block is only cached on one host although you can cache a block across multiple hosts by increasing the replication factor.

  1. Enable or disable HDFS caching through Cloudera Manager, using the configuration setting Maximum Memory Used for Caching for the HDFS service.
    This control sets the HDFS configuration parameter dfs_datanode_max_locked_memory, which specifies the upper limit of HDFS cache size on each node. Set up the HDFS caching for your Hadoop cluster.
    • All the other manipulation of the HDFS caching settings, such as what files are cached, is done through the command line, either Impala DDL statements or the Linux hdfs cacheadmin command.
  2. Using the hdfs cacheadmin command, set up one or more pools owned by the same user as the impalad daemon (typically impala).
    For example:
    hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000
    
  3. Once HDFS caching is enabled and one or more pools are available, on the Impala side, you specify the cache pool name defined by the hdfs cacheadmin command in the Impala DDL statements that enable HDFS caching for a table or partition, such as CREATE TABLE ... CACHED IN pool or ALTER TABLE ... SET CACHED IN pool.
  4. You can use hdfs cacheadmin -listDirectives to get a list of existing cache pools.
  5. You can use hdfs cacheadmin -listDirectives -stats to get detailed information about the pools.