Replication of Encrypted Data
HDFS supports encryption of data at rest (including data accessed through Hive). This topic describes how replication works within and between encryption zones and how to configure replication to avoid failures due to encryption.
Continue reading:
Encrypting Data in Transit Between Clusters
A source directory and destination directory may or may not be in an encryption zone. If the destination directory is in an encryption zone, the data on the destination directory is encrypted. If the destination directory is not in an encryption zone, the data on that directory is not encrypted, even if the source directory is in an encryption zone. For more information about HDFS encryption zones, see HDFS Transparent Encryption. Encryption zones are not supported in CDH versions 5.1 or lower. Replication of data from CDH 6 encrypted zones to CDH 5 is not supported.
When you configure encryption zones, you also configure a Key Management Server (KMS) to manage encryption keys. During replication, Cloudera Manager uses TLS/SSL to encrypt the keys when they are transferred from the source cluster to the destination cluster.
When you configure encryption zones, you also configure a Key Management Server (KMS) to manage encryption keys. When a HDFS replication command that specifies an encrypted source directory runs, Cloudera Manager temporarily copies the encryption keys from the source cluster to the destination cluster, using TLS/SSL (if configured for the KMS) to encrypt the keys. Cloudera Manager then uses these keys to decrypt the encrypted files when they are received from the source cluster before writing the files to the destination cluster.
During replication, data travels from the source cluster to the destination cluster using distcp. For clusters that use encryption zones, configure encryption of KMS key transfers between the source and destination using TLS/SSL. See Configuring TLS/SSL for the KMS.
- Enable TLS/SSL for HDFS clients on both the source and the destination clusters. For instructions, see Configuring TLS/SSL for HDFS, YARN and MapReduce. You may also need to configure trust between the SSL certificates on the source and destination.
- Enable TLS/SSL for the two peer Cloudera Manager Servers. See Configuring TLS Encryption for Cloudera Manager.
- Encrypt data transfer using HDFS Data Transfer Encryption. See Configuring Encrypted Transport for HDFS.
The following blog post provides additional information about encryption with HDFS: http://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/.
Security Considerations
The user you specify with the Run As field when scheduling a replication job requires full access to both the key and the data directories being replicated. This is not a recommended best practice for KMS management. If you change permissions in the KMS to enable this requirement, you could accidentally provide access for this user to data in other encryption zones using the same key. If a user is not specified in the Run As field, the replication runs as the default user, hdfs.
To access encrypted data, the user must be authorized on the KMS for the encryption zones they need to interact with. The user you specify with the Run As field when scheduling a replication must have this authorization. The key administrator must add ACLs to the KMS for that user to prevent authorization failure.
Key transfer using the KMS protocol from source to the client uses the REST protocol, which requires that you configure TLS/SSL for the KMS. When TLS/SSL is enabled, keys are not transferred over the network as plain text.