Setting Up HDFS Caching
Set up HDFS caching with Impala for improved performance.
Decide how much memory to devote to the HDFS cache on each host. The total memory available for cached data is the sum of the cache sizes on all the hosts. By default, any data block is only cached on one host although you can cache a block across multiple hosts by increasing the replication factor.
Enable or disable HDFS caching through Cloudera Manager, using the
configuration setting Maximum Memory Used for
Caching for the HDFS service. This control sets the HDFS configuration parameter
dfs_datanode_max_locked_memory, which specifies the upper limit of HDFS cache size on each node. Set up the HDFS caching for your Hadoop cluster.
- All the other manipulation of the HDFS caching settings, such as what files are cached, is done through the command line, either Impala DDL statements or the Linux hdfs cacheadmin command.
- Using the hdfs cacheadmin command, set up
one or more pools owned by the same user as the
impalad daemon (typically
hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000
- Once HDFS caching is enabled and one or more pools are
available, on the Impala side, you specify the cache pool name defined
hdfs cacheadmincommand in the Impala DDL statements that enable HDFS caching for a table or partition, such as
CREATE TABLE ... CACHED IN poolor
ALTER TABLE ... SET CACHED IN pool.
- You can use hdfs cacheadmin -listDirectives to get a list of existing cache pools.
- You can use hdfs cacheadmin -listDirectives -stats to get detailed information about the pools.