Set up HDFS caching with Impala for improved
performance.
Decide how much memory to devote to the HDFS cache on each host. The
total memory available for cached data is the sum of the cache sizes on
all the hosts. By default, any data block is only cached on one host
although you can cache a block across multiple hosts by increasing the
replication factor.
-
Enable or disable HDFS caching through Cloudera Manager, using the
configuration setting Maximum Memory Used for
Caching for the HDFS service.
This control
sets the HDFS configuration parameter
dfs_datanode_max_locked_memory
, which specifies the
upper limit of HDFS cache size on each node. Set up the HDFS caching
for your Hadoop cluster.
- All the other manipulation of the HDFS caching settings, such
as what files are cached, is done through the command line, either
Impala DDL statements or the Linux hdfs
cacheadmin command.
- Using the hdfs cacheadmin command, set up
one or more pools owned by the same user as the
impalad daemon (typically
impala
). For example:
hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000
- Once HDFS caching is enabled and one or more pools are
available, on the Impala side, you specify the cache pool name defined
by the
hdfs cacheadmin
command in the Impala DDL
statements that enable HDFS caching for a table or partition, such as
CREATE TABLE ... CACHED IN pool
or ALTER TABLE ... SET CACHED IN
pool
. - You can use hdfs cacheadmin -listDirectives
to get a list of existing cache pools.
- You can use hdfs cacheadmin -listDirectives
-stats to get detailed information about the
pools.