Configuring Encryption for Data Spills

Some CDH services can encrypt data stored temporarily on the local filesystem outside HDFS. For example, data may spill to disk during memory-intensive operations, or when a service exceeds its allotted memory on a host. You can enable on-disk spill encryption for the following services.

MapReduce v2 (YARN)

MapReduce v2 can encrypt intermediate files generated during encrypted shuffle and data spilled to disk during the map and reduce stages. To enable encryption on these intermediate files, use Cloudera Manager to specify settings for the MapReduce Client Advanced Configuration Snippet (Safety Valve) for the mapred-site.xml associated with a gateway node.

From the Cloudera Manager Admin Console:
  1. Select Clusters > YARN.
  2. Click the Configuration tab.
  3. In the Search field, enter MapReduce Client Advanced Configuration Snippet (Safety Valve) to find the safety valve on one of YARN's gateway nodes:

  4. Enter the XML in the field (or switch to Editor mode and enter each property and its value in the fields provided). A complete example of the XML is shown here:
    <property>
        <name>mapreduce.job.encrypted-intermediate-data</name>
        <value>true</value>
    </property>
    <property>
        <name>mapreduce.job.encrypted-intermediate-data-key-size-bits</name>
        <value>128</value>
    </property>
    <property>
        <name>mapreduce.job.encrypted-intermediate-data.buffer.kb</name>
        <value>128</value>
    </property>
    
  5. Click Save Changes.
The table provides descriptions of the properties used for data-spill encryption:
Property Default Description
mapreduce.job.encrypted-intermediate-data false Enables (true) and disables (false) encryption for intermediate MapReduce spills.
mapreduce.job.encrypted-intermediate-data-key-size-bits 128 Length of key (bits) used for encryption.
mapreduce.job.encrypted-intermediate-data.buffer.kb 128 Buffer size (Kb) for the stream written to disk after encryption.

HBase

HBase does not write data outside HDFS, and does not require spill encryption.

Impala

Impala allows certain memory-intensive operations to be able to write temporary data to disk in case these operations approach their memory limit on a host. For details, read SQL Operations that Spill to Disk. To enable disk spill encryption in Impala:

  1. Go to the Cloudera Manager Admin Console.
  2. Click the Configuration tab.
  3. Select Scope > Impala Daemon.
  4. Select Category > Security.
  5. Check the checkbox for the Disk Spill Encryption property.
  6. Click Save Changes.

Hive

Hive jobs occasionally write data temporarily to local directories. If you enable HDFS encryption, you must ensure that the following intermediate local directories are also protected:

  • LOCALSCRATCHDIR: The MapJoin optimization in Hive writes HDFS tables to a local directory and then uploads them to the distributed cache. To ensure these files are encrypted, either disable MapJoin by setting hive.auto.convert.join to false, or encrypt the local Hive Scratch directory (hive.exec.local.scratchdir) using Cloudera Navigator Encrypt.
  • DOWNLOADED_RESOURCES_DIR: JARs that are added to a user session and stored in HDFS are downloaded to hive.downloaded.resources.dir on the HiveServer2 local filesystem. To encrypt these JAR files, configure Cloudera Navigator Encrypt to encrypt the directory specified by hive.downloaded.resources.dir.
  • NodeManager Local Directory List: Hive stores JARs and MapJoin files in the distributed cache. To use MapJoin or encrypt JARs and other resource files, the yarn.nodemanager.local-dirs YARN configuration property must be configured to a set of encrypted local directories on all nodes.

For more information on Hive behavior with HDFS encryption enabled, see Using HDFS Encryption with Hive.

Flume

Flume supports on-disk encryption for log files written by the Flume file channels. See Configuring Encrypted On-disk File Channels for Flume.