Configuring CDP Services for HDFS Encryption
This page contains recommendations for setting up HDFS Transparent Encryption with various CDP services.
HBase
Recommendations
Make /hbase
an encryption zone. Do not create encryption zones as
subdirectories under /hbase
, because HBase may need to rename files
across those subdirectories. When you create the encryption zone, name the key
hbase-key
to take advantage of auto-generated KMS ACLs (Configuring
KMS Access Control Lists (ACLs)
).
Steps
On a cluster without HBase currently installed, create the
/hbase
directory and make that an encryption zone.
- Stop the HBase service.
- Move data from the
/hbase
directory to/hbase-tmp
. - Create an empty
/hbase
directory and make it an encryption zone. - Distcp all data from
/hbase-tmp
to/hbase
, preserving user-group permissions and extended attributes. - Start the HBase service and verify that it is working as expected.
- Remove the
/hbase-tmp
directory.
KMS ACL Configuration for HBase
In the KMS ACL (Configuring KMS Access Control Lists (ACLs)
), grant the
hbase
user and group DECRYPT_EEK
permission for the
HBase key:
<property>
<name>key.acl.hbase-key.DECRYPT_EEK</name>
<value>hbase hbase</value>
</description>
</property>
Hive
HDFS encryption has been designed so that files cannot be moved from
one encryption zone to another or from encryption zones to unencrypted
directories. Therefore, the landing zone for data when using the
LOAD DATA INPATH
command must always be inside the
destination encryption zone.
To use HDFS encryption with Hive, ensure you are using one of the following configurations:
Single Encryption Zone
With this configuration, you can use HDFS encryption by having all
Hive data inside the same encryption zone. In Cloudera Manager,
configure the Hive Scratch Directory
(hive.exec.scratchdir
) to be inside the encryption
zone.
Recommended HDFS Path:
/user/hive
To use the auto-generated KMS ACLs (Configuring KMS Access Control Lists (ACLs)
),
make sure you name the encryption key hive-key
.
For example, to configure a single encryption zone for the entire
Hive warehouse, you can rename /user/hive
to
/user/hive-old
, create an encryption zone at
/user/hive
, and then distcp
all
the data from /user/hive-old
to
/user/hive
.
In Cloudera Manager, configure the Hive Scratch Directory
(hive.exec.scratchdir
) to be inside the encryption
zone by setting it to /user/hive/tmp
, ensuring that
permissions are 1777
on
/user/hive/tmp
.
Multiple Encryption Zones
With this configuration, you can use encrypted databases or tables with different encryption keys. To read data from read-only encrypted tables, users must have access to a temporary directory that is encrypted at least as strongly as the table.
For example:
- Configure two encrypted tables,
ezTbl1
andezTbl2
. - Create two new encryption zones,
/data/ezTbl1
and/data/ezTbl2
. - Load data to the tables in Hive using
LOAD
statements.
For more information, see Changed Behavior after HDFS Encryption is Enabled.
Other Encrypted Directories
LOCALSCRATCHDIR
: The MapJoin optimization in Hive writes HDFS tables to a local directory and then uploads them to the distributed cache. To ensure these files are encrypted, either disable MapJoin by settinghive.auto.convert.join
tofalse
, or encrypt the local Hive Scratch directory (hive.exec.local.scratchdir
) using Cloudera Navigator Encrypt.DOWNLOADED_RESOURCES_DIR
: JARs that are added to a user session and stored in HDFS are downloaded tohive.downloaded.resources.dir
on the HiveServer2 local filesystem. To encrypt these JAR files, configure Cloudera Navigator Encrypt to encrypt the directory specified byhive.downloaded.resources.dir
.- NodeManager Local Directory List: Hive stores JARs and
MapJoin files in the distributed cache. To use MapJoin or encrypt
JARs and other resource files, the
yarn.nodemanager.local-dirs
YARN configuration property must be configured to a set of encrypted local directories on all nodes.
Changed Behavior after HDFS Encryption is Enabled
- Loading data from one encryption zone to another results in a
copy of the data. Distcp is used to speed up the process if the size
of the files being copied is higher than the value specified by
HIVE_EXEC_COPYFILE_MAXSIZE
. The minimum size limit forHIVE_EXEC_COPYFILE_MAXSIZE
is 32 MB, which you can modify by changing the value for thehive.exec.copyfile.maxsize
configuration property. - When loading data to encrypted tables, Cloudera strongly recommends using a
landing zone inside the same encryption zone as the table.
- Example 1: Loading unencrypted data to an encrypted
table - Use one of the following methods:
- If you are loading new unencrypted data to an encrypted
table, use the
LOAD DATA ...
statement. Because the source data is not inside the encryption zone, theLOAD
statement results in a copy. For this reason, Cloudera recommends landing data that you need to encrypt inside the destination encryption zone. You can usedistcp
to speed up the copying process if your data is inside HDFS. - If the data to be loaded is already inside a Hive table,
you can create a new table with a
LOCATION
inside an encryption zone as follows:
The location specified in theCREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <unencrypted_table>
CREATE TABLE
statement must be inside an encryption zone. Creating a table pointingLOCATION
to an unencrypted directory does not encrypt your source data. You must copy your data to an encryption zone, and then pointLOCATION
to that zone.
- If you are loading new unencrypted data to an encrypted
table, use the
- Example 2: Loading encrypted data to an encrypted table
- If the data is already encrypted, use the
CREATE TABLE
statement pointingLOCATION
to the encrypted source directory containing the data. This is the fastest way to create encrypted tables.CREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <encrypted_source_directory>
- Example 1: Loading unencrypted data to an encrypted
table - Use one of the following methods:
- Users reading data from encrypted tables that are read-only must have access to a temporary directory which is encrypted with at least as strong encryption as the table.
- Temporary data is now written to a directory named
.hive-staging
in each table or partition - Previously, an
INSERT OVERWRITE
on a partitioned table inherited permissions for new data from the existing partition directory. With encryption enabled, permissions are inherited from the table.
KMS ACL Configuration for Hive
When Hive joins tables, it compares the encryption key strength for each table. For this
operation to succeed, you must configure the KMS ACLs (Configuring KMS Access Control
Lists (ACLs)
) to allow the hive
user and group
READ
access to the Hive key:
<property>
<name>key.acl.hive-key.READ</name>
<value>hive hive</value>
</property>
If you have restricted access to the GET_METADATA
operation, you must grant permission for it to the
hive
user or group:
<property>
<name>hadoop.kms.acl.GET_METADATA</name>
<value>hive hive</value>
</property>
If you have disabled HiveServer2 Security Configuration
, you must configure the
KMS ACLs to grant DECRYPT_EEK
permissions to the hive
user, as well as any user accessing data in the Hive warehouse.
Cloudera recommends creating a group containing all Hive users, and
granting DECRYPT_EEK
access to that group.
For example, suppose user jdoe
(home directory
/user/jdoe
) is a Hive user and a member of the
group hive-users
. The encryption zone (EZ) key for
/user/jdoe
is named jdoe-key
, and
the EZ key for /user/hive
is
hive-key
. The following ACL example demonstrates
the required permissions:
<property>
<name>key.acl.hive-key.DECRYPT_EEK</name>
<value>hive hive-users</value>
</property>
<property>
<name>key.acl.jdoe-key.DECRYPT_EEK</name>
<value>jdoe,hive</value>
</property>
If you have enabled HiveServer2 impersonation, data is accessed by
the user submitting the query or job, and the user account
(jdoe
in this example) may still need to access
data in their home directory. In this scenario, the required
permissions are as follows:
<property>
<name>key.acl.hive-key.DECRYPT_EEK</name>
<value>nobody hive-users</value>
</property>
<property>
<name>key.acl.jdoe-key.DECRYPT_EEK</name>
<value>jdoe</value>
</property>
Hue
Recommendations
Make /user/hue
an encryption zone because Oozie workflows and other
Hue-specific data are stored there by default. When you create the encryption zone, name
the key hue-key
to take advantage of auto-generated KMS ACLs
(Configuring KMS Access Control Lists (ACLs)
).
Steps
On a cluster without Hue currently installed, create the
/user/hue
directory and make it an encryption
zone.
On a cluster with Hue already installed:
- Create an empty
/user/hue-tmp
directory. - Make
/user/hue-tmp
an encryption zone. - DistCp all data from
/user/hue
into/user/hue-tmp
. - Remove
/user/hue
and rename/user/hue-tmp
to/user/hue
.
KMS ACL Configuration for Hue
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant the
hue
and oozie
users and groups
DECRYPT_EEK
permission for the Hue key:
<property>
<name>key.acl.hue-key.DECRYPT_EEK</name>
<value>oozie,hue oozie,hue</value>
</property>
Impala
Recommendations
-
If HDFS encryption is enabled, configure Impala to encrypt data spilled to local disk.
-
In releases lower than Impala 2.2.0 / CDH 5.4.0, Impala does not support the
LOAD DATA
statement when the source and destination are in different encryption zones. If you are running an affected release and need to useLOAD DATA
with HDFS encryption enabled, copy the data to the table's encryption zone prior to running the statement. -
Use Cloudera Navigator to lock down the local directory where Impala UDFs are copied during execution. By default, Impala copies UDFs into /tmp, and you can configure this location through the
--local_library_dir
startup flag for the impalad daemon. -
Limit the rename operations for internal tables once encryption zones are set up. Impala cannot do an
ALTER TABLE RENAME
operation to move an internal table from one database to another, if the root directories for those databases are in different encryption zones. If the encryption zone covers a table directory but not the parent directory associated with the database, Impala cannot do anALTER TABLE RENAME
operation to rename an internal table, even within the same database. -
Avoid structuring partitioned tables where different partitions reside in different encryption zones, or where any partitions reside in an encryption zone that is different from the root directory for the table. Impala cannot do an
INSERT
operation into any partition that is not in the same encryption zone as the root directory of the overall table. -
If the data files for a table or partition are in a different encryption zone than the HDFS trashcan, use the
PURGE
keyword at the end of theDROP TABLE
orALTER TABLE DROP PARTITION
statement to delete the HDFS data files immediately. Otherwise, the data files are left behind if they cannot be moved to the trashcan because of differing encryption zones. This syntax is available in Impala 2.3 / CDH 5.5 and higher.
Steps
Start every impalad
process with the
--disk_spill_encryption=true
flag set. This
encrypts all spilled data using AES-256-CFB. Set this flag by
selecting the Disk Spill Encryption checkbox in
the Impala configuration ( ).
KMS ACL Configuration for Impala
Cloudera recommends making the impala
user a member
of the hive
group, and following the ACL
recommendations in KMS ACL Configuration for Hive.
MapReduce and YARN
MapReduce v1
Recommendations
MRv1 stores both history and logs on local disks by default. Even if you do configure history to be stored on HDFS, the files are not renamed. Hence, no special configuration is required.
MapReduce v2 (YARN)
Recommendations
Make /user/history
a single encryption zone, because history files are
moved between the intermediate
and done
directories,
and HDFS encryption does not allow moving encrypted files across encryption zones. When
you create the encryption zone, name the key mapred-key
to take
advantage of auto-generated KMS ACLs (Configuring KMS Access Control Lists
(ACLs)
).
Steps
On a cluster with MRv2 (YARN) installed, create the
/user/history
directory and make that an
encryption zone.
If /user/history
already exists and is not
empty:
- Create an empty
/user/history-tmp
directory. - Make
/user/history-tmp
an encryption zone. - DistCp all data from
/user/history
into/user/history-tmp
. - Remove
/user/history
and rename/user/history-tmp
to/user/history
.
KMS ACL Configuration for MapReduce
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant
DECRYPT_EEK
permission for the MapReduce key to the
mapred
and yarn
users and the
hadoop
group:
<property>
<name>key.acl.mapred-key.DECRYPT_EEK</name>
<value>mapred,yarn hadoop</value>
</description>
</property>
Search
Recommendations
Make /solr
an encryption zone. When you create the encryption zone, name
the key solr-key
to take advantage of auto-generated KMS ACLs
(Configuring KMS Access Control Lists (ACLs)
).
Steps
On a cluster without Solr currently installed, create the
/solr
directory and make that an encryption
zone.
On a cluster with Solr already installed:
- Create an empty
/solr-tmp
directory. - Make
/solr-tmp
an encryption zone. - DistCp all data from
/solr
into/solr-tmp
. - Remove
/solr
, and rename/solr-tmp
to/solr
.
KMS ACL Configuration for Search
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant the
solr
user and group DECRYPT_EEK
permission for the
Solr key:
<property>
<name>key.acl.solr-key.DECRYPT_EEK</name>
<value>solr solr</value>
</description>
</property>
Spark
Recommendations
- By default, application event logs are stored at
/user/spark/applicationHistory
, which can be made into an encryption zone. - Spark also optionally caches its JAR file at
/user/spark/share/lib
(by default), but encrypting this directory is not required. - Spark does not encrypt shuffle data. To do so, configure the Spark
local directory,
spark.local.dir
(in Standalone mode), to reside on an encrypted disk. For YARN mode, make the corresponding YARN configuration changes.
KMS ACL Configuration for Spark
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant
DECRYPT_EEK
permission for the Spark key to the spark
user and any groups that can submit Spark jobs:
<property>
<name>key.acl.spark-key.DECRYPT_EEK</name>
<value>spark spark-users</value>
</property>
Sqoop
Recommendations
- For Hive support: Ensure that you are using Sqoop with
the
--target-dir
parameter set to a directory that is inside the Hive encryption zone. For more details, see Hive. - For append/incremental support: Make sure that the
sqoop.test.import.rootDir
property points to the same encryption zone as the--target-dir
argument. - For HCatalog support: No special configuration is required.