Configuring CDH Services for HDFS Encryption
Hive
HDFS encryption has been designed such that files cannot be moved from one encryption zone to another encryption zone or from encryption zones to unencrypted directories. Hence, the landing zone for data when using the LOAD DATA INPATH command should always be inside the destination encryption zone.
If you want to use HDFS encryption with Hive, ensure you are using one of the following configurations:
Single Encryption Zone
With this configuration, you can use HDFS encryption by having all Hive data inside the same encryption zone. Additionally, in Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone.
Recommended HDFS Path: /user/hive
For example, to configure a single encryption zone for the entire Hive warehouse, you can rename /user/hive to /user/hive-old, create an encryption zone at /user/hive, and then distcp all the data from /user/hive-old to /user/hive.
Additionally, in Cloudera Manager, configure the Hive Scratch Directory (hive.exec.scratchdir) to be inside the encryption zone by setting it to /user/hive/tmp, ensuring the permissions are 1777 on /user/hive/tmp.
Multiple Encryption Zones
With this configuration, you can use encrypted databases or tables with different encryption keys. The only limitation is that in order to read data from read-only encrypted tables, users must have access to a temporary directory which is encrypted with at least as strong encryption as the table.
For example, you can configure two encrypted tables, ezTbl1 and ezTbl2. Create two new encryption zones, /data/ezTbl1 and /data/ezTbl2. Load data to the tables in Hive as usual using LOAD statements. See the Changed Behavior after HDFS Encryption is Enabled section below for more information.
Other Encrypted Directories
- LOCALSCRATCHDIR: The MapJoin optimization in Hive writes HDFS tables out to a local directory and then uploads them to the distributed cache. If you want to enable encryption, you will either need to disable MapJoin or encrypt the local Hive Scratch directory (hive.exec.local.scratchdir).
- DOWNLOADED_RESOURCES_DIR: Jars which are added to a user session and stored in HDFS are downloaded to hive.downloaded.resources.dir. If you want these Jar files to be encrypted, configure hive.downloaded.resources.dir to be part of an encryption zone. This directory is local to the HiveServer2.
- NodeManager Local Directory List: Since Hive stores Jars and MapJoin files in the distributed cache, if you'd like to use MapJoin or encrypt Jars and
other resource files, the YARN configuration property, NodeManager Local Directory List (yarn.nodemanager.local-dirs), must be configured to a set of encrypted local
directories on all nodes.
Alternatively, you can disable MapJoin by setting hive.auto.convert.join to false.
Changed Behavior after HDFS Encryption is Enabled
- Loading data from one encryption zone to another will result in a copy of the data. Distcp will be used to speed up the process if the size of the files being copied is higher than the value specified by HIVE_EXEC_COPYFILE_MAXSIZE. The minimum size limit for HIVE_EXEC_COPYFILE_MAXSIZE is 32 MB, which can be modified by changing the value for the hive.exec.copyfile.maxsize configuration property.
- When loading data to encrypted tables, Cloudera strongly recommends using a landing zone inside the same encryption zone as the table.
- Example 1: Loading unencrypted data to an encrypted table - There are 2 approaches to doing this.
- If you're loading new unencrypted data to an encrypted table, just load the data using the LOAD DATA ... statement. Since the source data does not reside inside the encryption zone, the LOAD statement will result in a copy. This is why Cloudera recommends landing data (that you expect to encrypt) inside the destination encryption zone. However, this approach may use distcp to speed up the copying process if your data is inside HDFS.
- If the data to be loaded is already inside a Hive table, you can create a new table with a LOCATION inside an encryption zone as follows:
CREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <unencrypted_table>
Note that the location specified in the CREATE TABLE statement above needs to be inside an encryption zone. Creating a table pointing LOCATION to an unencrypted directory will not encrypt your source data. You must copy your data to an encryption zone, and then point LOCATION to that encryption zone.
- Example 2: Loading encrypted data to an encrypted table - If the data to be loaded is already encrypted, use the CREATE
TABLE statement pointing LOCATION to the encrypted source directory where your data is. This is the fastest way to create encrypted tables.
CREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <encrypted_source_directory>
- Example 1: Loading unencrypted data to an encrypted table - There are 2 approaches to doing this.
- Users reading data from encrypted tables which are read-only, must have access to a temp directory which is encrypted with at least as strong encryption as the table.
- Temp data is now written to a directory named .hive-staging within each table or partition
- Previously, an INSERT OVERWRITE on a partitioned table inherited permissions for new data from the existing partition directory. With encryption enabled, permissions are inherited from the table.
Impala
Recommendations
-
If HDFS encryption is enabled, configure Impala to encrypt data spilled to local disk.
-
Impala does not support the LOAD DATA statement when the source and destination are in different encryption zones. If you need to use LOAD DATA, copy the data to the table's encryption zone prior to running the statement.
-
Use Cloudera Navigator to lock down the local directory where Impala UDFs are copied during execution. By default, Impala copies UDFs into /tmp, and you can configure this location through the --local_library_dir startup flag for the impalad daemon.
-
Limit the rename operations for internal tables once encryption zones are set up. Impala cannot do an ALTER TABLE RENAME operation to move an internal table from one database to another, if the root directories for those databases are in different encryption zones. If the encryption zone covers a table directory but not the parent directory associated with the database, Impala cannot do an ALTER TABLE RENAME operation to rename an internal table even within the same database.
-
Avoid structuring partitioned tables where different partitions reside in different encryption zones, or where any partitions reside in an encryption zone that is different from the root directory for the table. Impala cannot do an INSERT operation into any partition that is not in the same encryption zone as the root directory of the overall table.
HBase
Recommendations
Make /hbase an encryption zone. Do not create encryption zones as subdirectories under /hbase, as HBase may need to rename files across those subdirectories.
Steps
On a cluster without HBase currently installed, create the /hbase directory and make that an encryption zone.
- Stop the HBase service.
- Move data from the /hbase directory to /hbase-tmp.
- Create an empty /hbase directory and make it an encryption zone.
- Distcp all data from /hbase-tmp to /hbase preserving user-group permissions and extended attributes.
- Start the HBase service and verify that it is working as expected.
- Remove the /hbase-tmp directory.
Search
Steps
On a cluster without Solr currently installed, create the /solr directory and make that an encryption zone. On a cluster with Solr already installed, create an empty /solr-tmp directory, make /solr-tmp an encryption zone, distcp all data from /solr into /solr-tmp, remove /solr and rename /solr-tmp to /solr.
Sqoop
Recommendations
- For Hive support: Ensure that you are using Sqoop with the --target-dir parameter set to a directory that is inside the Hive encryption zone. For more details, see Hive
- For append/incremental support: Make sure that the sqoop.test.import.rootDir property points to the same encryption zone as the above --target-dir argument.
- For HCatalog support: No special configuration should be required
Hue
Recommendations
Make /user/hue an encryption zone since that's where Oozie workflows and other Hue specific data are stored by default.
Steps
On a cluster without Hue currently installed, create the /user/hue directory and make that an encryption zone. On a cluster with Hue already installed, create an empty /user/hue-tmp directory, make /user/hue-tmp an encryption zone, distcp all data from /user/hue into /user/hue-tmp, remove /user/hue and rename /user/hue-tmp to /user/hue.
Spark
Recommendations
- By default, application event logs are stored at /user/spark/applicationHistory which can be made into an encryption zone.
- Spark also optionally caches its jar file at /user/spark/share/lib (by default), but encrypting this directory is not necessary.
- Spark does not encrypt shuffle data. However, if that is desired, you should configure Spark's local directory, spark.local.dir (in Standalone mode), to reside on an encrypted disk. For YARN mode, make the corresponding YARN configuration changes.
MapReduce and YARN
MapReduce v1
MapReduce v2 (YARN)
Recommendations
Make /user/history a single encryption zone, since history files are moved between the intermediate and done directories, and HDFS encryption does not allow moving encrypted files across encryption zones.
Steps
On a cluster with MRv2 (YARN) installed, create the /user/history directory and make that an encryption zone. If /user/history already exists and is not empty, create an empty /user/history-tmp directory, make /user/history-tmp an encryption zone, distcp all data from /user/history into /user/history-tmp, remove /user/history and rename /user/history-tmp to /user/history.