Validating Hadoop Key Operations
hadoop key create
to create a test key, and then use hadoop
key list
to retrieve the key
list:sudo -u <key_admin> hadoop key create keytrustee_test hadoop key list
Interacting with the KMS and creating encryption zones requires the
use of two new CLI commands: hadoop key
and hdfs crypto
. The
following sections will help you get started with creating encryption keys and setting up
encryption zones.
Before continuing, make sure that your KMS ACLs have been set up according to best practices.
For more information, see Configuring KMS Access Control Lists (ACLs)
.
hadoop key create
to create a test key, and then use hadoop
key list
to retrieve the key
list:sudo -u <key_admin> hadoop key create keytrustee_test hadoop key list
Once a KMS has been set up and the NameNode and HDFS clients have been correctly
configured, use the hadoop key
and hdfs crypto
command-line tools to create encryption keys and set up new encryption zones.
keyadmin
for the user/group
(regardless of the application that will be using the encryption
zone):sudo -u hdfs hadoop key create <key_name>
sudo -u hdfs hadoop fs -mkdir /encryption_zone sudo -u hdfs hdfs crypto -createZone -keyName <key_name> -path /encryption_zoneYou can verify creation of the new encryption zone by running the
-listZones
command. You should see the encryption
zone along with its key listed as
follows:$ sudo -u hdfs hdfs crypto -listZones /encryption_zone <key_name>
For more information and recommendations on creating encryption zones for each CDP
component, see Configuring CDP Services for HDFS Encryption
.
distcp
. For
example:sudo -u hdfs hadoop distcp /user/dir /encryption_zone
A common use case for DistCp is to replicate data between clusters for backup and
disaster recovery purposes. This is typically performed by the cluster administrator, who
is an HDFS superuser. To retain this workflow when using HDFS encryption, a new virtual
path prefix has been introduced, /.reserved/raw/
, that gives superusers
direct access to the underlying block data in the filesystem. This allows superusers to
distcp
data without requiring access to encryption keys, and avoids the
overhead of decrypting and re-encrypting data. It also means the source and destination
data will be byte-for-byte identical, which would not have been true if the data was being
re-encrypted with a new EDEK.
By default, distcp
compares checksums provided by
the filesystem to verify that data was successfully copied to the
destination. When copying from an encrypted location, the file
system checksums will not match because the underlying block data is
different. This is true whether or not the destination location is
encrypted or unencrypted.
In this case, you can specify the -skipcrccheck
and
-update
flags to avoid verifying checksums. When you use
-skipcrccheck, distcp
checks the file integrity by performing a file
size comparison, right after the copy completes for each file.
sudo -u hdfs hadoop fs -rm -r -skipTrash /encryption_zone
If you are using the Java KeyStore KMS, make sure you regularly back up the Java KeyStore
that stores the encryption keys. If you are using the Key Trustee KMS and Key Trustee
Server, see Backing up Key Trustee Server and Clients
for instructions on backing up
Key Trustee Server and Key Trustee KMS.
Before attempting to roll an encryption key (also known as an encryption zone key, or EZ
key), familiarize yourself with the concepts described in HDFS Transparent
Encryption
, as the material in these sections presumes you are familiar with the
fundamentals of HDFS transparent encryption and Cloudera data at rest encryption.
When you roll an EZ key, you are essentially creating a new version of the key
(ezKeyVersionName
). Rolling EZ keys regularly helps enterprises minimize
the risk of key exposure. If a malicious attacker were to obtain the EZ key and decrypt
encrypted data encryption keys (EDEKs) into DEKs, they could gain the ability to decrypt
HDFS files. Rolling an EZ key ensures that all DEKs for newly-created files will be
encrypted with the new version of the EZ key. The older EZ key version that the attacker
obtained cannot decrypt these EDEKs. You may want to roll the encryption key periodically,
as part of your security policy or when an external security compromise is detected.
$ hdfs crypto –listZones /ez key1 /ez2 key2 /user key1The first column identifies the encryption zone paths; the second column identifies the encryption key name.
hdfs crypto -getFileEncryptionInfo
command. Note
the EZ key version name and value, which you can use for comparison and verification
after rolling the EZ key.
$ hdfs crypto –getFileEncryptionInfo –path /ez/f {cipherSuite: {name: AES/CTR/NoPadding, algorithmBlockSize: 16}. cryptoProtocolVersion: CryptoProtocolVersion{description=’Encryption zones’, version=2, unknownValue=null}, edek: 373c0c2e919c27e58c1c343f54233cbd, iv: d129c913c8a34cde6371ec95edfb7337, keyName: key1, ezKeyVersionName: 7mbvopZ0Weuvs0XtTkpGw3G92KuWc4e4xcTXl0bXCpF}Log off as HDFS Superuser.
<key name>
is
key1:hadoop key roll key1
This operation
contacts the KMS and rolls the keys there. Note that this can take a considerable
amount of time, depending on the number of key versions residing in the KMS.
Rolling key version from KeyProvider: org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider@5ea434c8 for keyName: key1 key1 has been successfully rolled. org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider@5ea434c8 has been updated.
$ hdfs crypto –getFileEncryptionInfo –path /ez/new_file {cipherSuite: {name: AES/CTR/NoPadding, algorithmBlockSize: 16}. cryptoProtocolVersion: CryptoProtocolVersion{description=’Encryption zones’, version=2, unknownValue=null}, edek: 9aa13ea4a700f96287cfe1349f6ff4f2, iv: 465c878ad9325e42fa460d2a22d12a72, keyName: key1, ezKeyVersionName: 4tuvorJ6Feeqk8WiCfdDs9K32KuEj7g2ydCAv0gNQbY}Alternatively, you can use KMS Rest API to view key metadata and key versions. Elements appearing in brackets should be replaced with your actual values. So in this case, before rolling a key, you can view the key metadata and versions as follows:
$ curl -k --negotiate -u: "https://<KMS_HOSTNAME>:16000/kms/v1/key/<key-name>/_metadata" { "name" : "<key-name>", "cipher" : "<cipher>", "length" : <length>, "description" : "<decription>", "created" : <millis-epoc>, "versions" : <versions> (For example, 1) } $ curl -k --negotiate -u: "https://<KMS_HOSTNAME>:16000/kms/v1/key/<key-name>/_currentversion" { "material" : "<material>", "name" : "<key-name>", "versionName" : "<versionName>" (For example, version 1) }
$ hadoop key roll key1 Rolling key version from KeyProvider: KMSClientProvider[https://<KMS_HOSTNAME>:16000/kms/v1/] for key name: <key-name> key1 has been successfully rolled. KMSClientProvider[https://<KMS_HOSTNAME>/kms/v1/] has been updated. $ curl -k --negotiate -u: "https://<KMS_HOSTNAME>:16000/kms/v1/key/<key-name>/_currentversion" { "material" : "<material>", (New material) "name" : "<key-name>", "versionName" : "<versionName>" (New version name. For example, version 2) } $ curl -k --negotiate -u: "https://<KMS_HOSTNAME>:16000/kms/v1/key/<key-name>/_metadata" { "name" : "<key-name>", "cipher" : "<cipher>", "length" : <length>, "description" : "<decription>", "created" : <millis-epoc>, "versions" : <versions> (For example, version 2) }
Before attempting to re-encrypt an EDEK, familiarize yourself with the concepts described
in HDFS Transparent Encryption
, as the material in this section presumes you are
familiar with the fundamentals of HDFS transparent encryption and Cloudera data at rest
encryption.
When you re-encrypt an EDEK, you are essentially decrypting the original EDEK created by the DEK, and then re-encrypting it using the new (rolled) version of the EZ key (see Rolling Encryption Keys). The file's metadata, which is stored in the NameNode, is then updated with this new EDEK. Re-encryption does not impact the data in the HDFS files or the DEK–the same DEK is still used to decrypt the file, so re-encryption is essentially transparent.
In addition to minimizing security risks, re-encrypting the EDEK offers the following capabilities and benefits:
Running the re-encryption command without successfully verifying the preceding items will result in failures with errors.
This section identifies limitations associated with the re-encryption of EDEKs.
EDEK re-encryption doesn't change EDEKs on snapshots, due to the immutable nature HDFS snapshots. Thus, you should be aware that after EZ key exposure, the Key Administrator must delete snapshots.
This scenario operates on the assumption that an encryption zone has already been set up for this cluster.
hdfs
crypto
command:$ hdfs crypto [-createZone –keyName <keyName> -path <path>] [-listZones] [-provisionTrash –path <path>] [-getFileEncryptionInfo –path <path>] [-reencryptZone <action> -path <zone>] [-listReencryptionStatus] [-help <command-name>]
$ hdfs crypto –listZones /ez key1The first column identifies the encryption zone path (
/ez
); the second column
identifies the encryption key name (key1
).<key name>
is
key1:hadoop key roll key1
This operation
contacts the KMS and rolls the keys. Note that this can take a considerable amount of
time, depending on the number of key versions.
Rolling key version from KeyProvider: org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider@5ea434c8 for keyName: key1 key1 has been successfully rolled. org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider@5ea434c8 has been updated.
ezKeyVersionName
) and EDEK have
changed:$ hdfs crypto –getFileEncryptionInfo –path /ez/f {cipherSuite: {name: AES/CTR/NoPadding, algorithmBlockSize: 16}. cryptoProtocolVersion: CryptoProtocolVersion{description=’Encryption zones’, version=2, unknownValue=null}, edek: 9aa13ea4a700f96287cfe1349f6ff4f2, iv: d129c913c8a34cde6371ec95edfb7337, keyName: key1, ezKeyVersionName: 7mbvopZ0Weuvs0XtTkpGw3G92KuWc4e4xcTXl0bXCpF}
hdfs crypto –reencryptZone –start –path /ez
The
following information appears when the submission is complete. At this point, the
NameNode is processing and re-encrypting all of the EDEKs under the
/ez
directory.re-encrypt command successfully submitted for zone: /ez action: START:Depending on the number of files, the re-encryption operation can take a long time. Re-encrypting a 1Million EDEK file typically takes between 2-6 minutes, depending on the NameNode hardware. To check the status of the re-encryption for the zone:
hdfs crypto –listReencryptionStatus
Column Name | Description | Sample Data |
---|---|---|
ZoneName | The encryption zone name | /ez |
Status |
|
Completed |
EZKey Version Name | The encryption zone key version name, which used for re-encryption comparison. After re-encryption is complete, all files in the encryption zone are guaranteed to have an EDEK whose encryption zone key version is at least equal to this version. | ZMHfRoGKeXXgf0QzCX8q16NczIw2sq0rWRTOHS3YjCz |
Submission Time | The time at which the re-encryption operation commenced. | 2017-09-07 10:01:09,262-0700 |
Is Canceled? | True: the encryption operation has been canceled. False: the encryption operation has not been canceled. |
False |
Completion Time | The time at which the re-encryption operation completed. | 2017-09-07 10:01:10,441-0700 |
Number of files re-encrypted | The number that appears in this column reflects only the files
whose EDEKs have been updated. If a file is created after the key is rolled,
then it will already have an EDEK that has been encrypted by the new key
version, so the re-encryption operation will skip that file. In other words,
it's possible for a "Completed" re-encryption to reflect a number of
re-encrypted files that is less than the number of files actually in the
encryption zone. Note: In cases when you re-encrypt an EZ key that has already been re-encrypted and there are no new files, the number of files re-encrypted will be 0. |
1 |
Number of failures | When 0, no errors occurred during the re-encryption operation. If larger than 0, then investigate the NameNode log, and re-encrypt. | 0 |
Last file Checkpointed | Identifies the current position of the re-encryption process in the encryption zone--in other words, the file that was most recently re-encrypted. | 0 |
$ hdfs crypto –getFileEncryptionInfo –path /ez/f {cipherSuite: {name: AES/CTR/NoPadding, algorithmBlockSize: 16}. cryptoProtocolVersion: CryptoProtocolVersion{description=’Encryption zones’, version=2, unknownValue=null}, edek: 373c0c2e919c27e58c1c343f54233cbd, iv: d129c913c8a34cde6371ec95edfb7337, keyName: key1, ezKeyVersionName: ZMHfRoGKeXXgf0QzCX8q16NczIw2sq0rWRTOHS3YjCz }
This section includes information that can help you manage various facets of the EDEK re-encryption process.
Only users with the HDFS Superuser privilege can cancel the EDEK re-encryption after the operation has started.
To cancel a re-encryption:
hadoop crypto -reencryptZone cancel -path <zone>
While it is not recommended, it is possible to roll the encryption zone key version on the KMS while a re-encryption of that encryption zone is already in progress in the NameNode. The re-encryption is guaranteed to complete with all DEKs re-encrypted, with a key version equal to or later than the encryption zone key version when the re-encryption command was submitted. This means that, if initially the key version is rolled from v0 to v1, then a re-encryption command was submitted. If later on the KMS the key version is rolled again to v2, then all EDEKs will be at least re-encrypted to v1. To ensure that all EDEKs are re-encrypted to v2, submit another re-encryption command for the encryption zone.
Rolling keys during re-encryption is not recommended because of the potential negative impact on key management operations. Due to the asynchronous nature of re-encryption, there is no guarantee of when, exactly, the rolled encryption keys will take effect. Re-encryption can only guarantee that all EDEKs are re-encrypted at least on the EZ key version that existed when the re-encryption command is issued.
With the default operation settings, you will not typically need to throttle re-encryption operations. However, in cases of excessive performance impact due to the re-encryption of large numbers of files, advanced users have the option of throttling the operation so that the impact on the HDFS NameNode and KT KMS are minimized.
You can monitor the HDFS NameNode heap and CPU usage from Cloudera Manager.
What kind of feedback do you have?