Managing Encryption Keys and Zones
Interacting with the KMS and creating encryption zones requires the use of two new CLI commands: hadoop key and hdfs crypto. The following sections will help you get started with creating encryption keys and setting up encryption zones.
Before continuing, make sure that your KMS ACLs have been set up according to best practices. For more information, see Configuring KMS Access Control Lists.
Validating Hadoop Key Operations
$ sudo -u <key_admin> hadoop key create keytrustee_test $ hadoop key list
Creating Encryption Zones
Once a KMS has been set up and the NameNode and HDFS clients have been correctly configured, use the hadoop key and hdfs crypto command-line tools to create encryption keys and set up new encryption zones.
- Create an encryption key for your zone as the keyadmin for the user/group (regardless of the application that will be using the encryption zone):
$ sudo -u hdfs hadoop key create <key_name>
- Create a new empty directory and make it an encryption zone using the key created above.
$ sudo -u hdfs hadoop fs -mkdir /encryption_zone $ sudo -u hdfs hdfs crypto -createZone -keyName <key_name> -path /encryption_zone
You can verify creation of the new encryption zone by running the -listZones command. You should see the encryption zone along with its key listed as follows:$ sudo -u hdfs hdfs crypto -listZones /encryption_zone <key_name>
For more information and recommendations on creating encryption zones for each CDH component, see Configuring CDH Services for HDFS Encryption.
Adding Files to an Encryption Zone
Existing data can be encrypted by coping it copied into the new encryption zones using tools like DistCp.
sudo -u hdfs hadoop distcp /user/dir /encryption_zone
-
For more information on KMS setup and high availability configuration, see Configuring the Key Management Server (KMS).
-
For instructions on securing the KMS using Kerberos, TLS/SSL communication and ACLs, see Securing the Key Management Server (KMS).
-
If you want to use the KMS to encrypt data used by other CDH services, see Configuring CDH Services for HDFS Encryption for information on recommended encryption zones for each service.
DistCp Considerations
A common use case for DistCp is to replicate data between clusters for backup and disaster recovery purposes. This is typically performed by the cluster administrator, who is an HDFS superuser. To retain this workflow when using HDFS encryption, a new virtual path prefix has been introduced, /.reserved/raw/, that gives superusers direct access to the underlying block data in the filesystem. This allows superusers to distcp data without requiring access to encryption keys, and avoids the overhead of decrypting and re-encrypting data. It also means the source and destination data will be byte-for-byte identical, which would not have been true if the data was being re-encrypted with a new EDEK.
Copying data from unencrypted locations
By default, distcp compares checksums provided by the filesystem to verify that data was successfully copied to the destination. When copying from an encrypted location, the file system checksums will not match because the underlying block data is different. This is true whether or not the destination location is encrypted or unencrypted.
In this case, you can specify the -skipcrccheck and -update flags to avoid verifying checksums. When you use -skipcrccheck, distcp checks the file integrity by performing a file size comparison, right after the copy completes for each file.
Deleting Encryption Zones
$ sudo -u hdfs hadoop fs -rm -r -skipTrash /encryption_zone
Backing Up Encryption Keys
If you are using the Java KeyStore KMS, make sure you regularly back up the Java KeyStore that stores the encryption keys. If you are using the Key Trustee KMS and Key Trustee Server, see Backing Up and Restoring Key Trustee Server and Clients for instructions on backing up Key Trustee Server and Key Trustee KMS.