Configuring Apache HDFS Encryption
Also available as:
PDF
loading table of contents...

Copying Files to or from an Encryption Zone

Information on how to copy existing files to or from an encryption zone, use a tool like distcp.

Note: for separation of administrative roles, do not use the hdfs user to create encryption zones. Instead, designate another administrative account for creating encryption keys and zones. See “Appendix: Creating an HDFS Admin User” for more information.

The files will be encrypted using a file-level key generated by the Ranger Key Management Service.

DistCp Considerations

DistCp is commonly used to replicate data between clusters for backup and disaster recovery purposes. This operation is typically performed by the cluster administrator, via an HDFS superuser account.

To retain this workflow when using HDFS encryption, a new virtual path prefix has been introduced, /.reserved/raw/. This virtual path gives super users direct access to the underlying encrypted block data in the file system, allowing super users to distcp data without requiring access to encryption keys. This also avoids the overhead of decrypting and re-encrypting data. The source and destination data will be byte-for-byte identical, which would not be true if the data were re-encrypted with a new EDEK.

Note
Note

When using /.reserved/raw/ to distcp encrypted data, make sure you preserve extended attributes with the -px flag. This is necessary because encrypted attributes such as the EDEK are exposed through extended attributes; they must be preserved to be able to decrypt the file. For example:

sudo -u encr hadoop distcp -px hdfs:/cluster1-namenode:50070/.reserved/raw/apps/enczone hdfs:/cluster2-namenode:50070/.reserved/raw/apps/enczone

This means that if the distcp operation is initiated at or above the encryption zone root, it will automatically create a new encryption zone at the destination (if one does not already exist).

Recommendation: To avoid potential mishaps, first create identical encryption zones on the destination cluster.

Copying between encrypted and unencrypted locations

By default, distcp compares file system checksums to verify that data was successfully copied to the destination.

When copying between an unencrypted and encrypted location, file system checksums will not match because the underlying block data is different. In this case, specify the -skipcrccheck and -update flags to avoid verifying checksums.