Setting up Data Cache for Remote Reads

In CDP public cloud, the Impala remote data cache is enabled by default in Cloudera Data Warehouse and in most Data Hub templates that include Impala like Data Mart. However in Data Hub, you can enable or disable the data cache as needed through Cloudera Manager.

When Impala compute nodes and its storage are not co-located, the network bandwidth requirement goes up as the network traffic includes the data fetch as well as the shuffling exchange traffic of intermediate results. To mitigate the pressure on the network, you can enable the compute nodes to cache the working set read from remote filesystems, such as, remote HDFS data node, S3, ABFS, ADLS.

To enable remote data cache for data hubs using Cloudera Manager:

  1. In Cloudera Manager, navigate to Clusters > Impala Service.
  2. In the Configuration tab, select Enable Local Data Cache to enable the local Impala Daemon data cache for caching of remote reads.
  3. In Impala Daemon Data Cache Directories, add the directories Impala Daemon will use for caching of remote read data.
  4. In Impala Daemon Data Cache Per Directory Capacity, specify the maximum amount of local disk space Impala will use per daemon in each of the configured directrories for caching of remote read data.
  5. Click Save Changes and restart the Impala service.