Adding Files to an Encryption Zone
You can add files to an encryption zone by copying them to the encryption zone using
distcp
.
sudo -u hdfs hadoop distcp /user/dir /encryption_zone
DistCp Considerations
A common use case for DistCp is to replicate data between clusters for backup and disaster
recovery purposes. This is typically performed by the cluster administrator, who is an HDFS
superuser. To retain this workflow when using HDFS encryption, the virtual path prefix
/.reserved/raw/
has been introduced, that gives superusers direct access
to the underlying block data in the filesystem. This allows superusers to
distcp
data without requiring access to encryption keys, and avoids the
overhead of decrypting and re-encrypting data. It also means the source and destination data
will be byte-for-byte identical, which would not have been true if the data was being
re-encrypted with a new EDEK.
Copying data from encrypted locations
By default, distcp
compares checksums provided by the filesystem to verify
that data was successfully copied to the destination. When copying from an encrypted
location, the file system checksums will not match because the underlying block data is
different. This is true whether or not the destination location is encrypted or
unencrypted.
In this case, you can specify the -skipcrccheck
and
-update
flags to avoid verifying checksums. When you use
-skipcrccheck
, distcp
checks the file integrity by
performing a file size comparison, right after the copy completes for each file.