Copying Data with DistCp
You can use DistCp to copy data between your cluster’s HDFS filesystem and your cloud storage. DistCp is a utility for copying large data sets between distributed filesystems. To access DistCp utility, SSH to any node in your cluster.
Copying Data from HDFS to Cloud Storage
To transfer data from HDFS to an object store, list the path to HDFS first and the path to the cloud storage second:
hadoop distcp hdfs://source-folder s3a://destination-bucket
Updating Existing Data
If you would like to transfer only the files that don’t already exist in the target
folder, add the update
option to improve the copy
speeds:
hadoop distcp -update -skipcrccheck -numListstatusThreads 40 hdfs://source-folder s3a://destination-bucket
When copying between cloud object stores and HDFS, the "update" check only compares file size; it does not use checksums to detect other changes in the data.
Copying Data from Cloud Storage to HDFS
To copy data from your cloud storage container to HDFS, list the path of the cloud storage data first and the path to HDFS second. For example:
hadoop distcp adl://alice.azuredatalakestore.net/datasets /tmp/datasets
This downloads all files from the path in the ADLS bucket to /tmp/datasets
in the cluster filesystem.
You can add the -update
option to only download data which has
changed:
hadoop distcp -update -skipcrccheck adl://alice.azuredatalakestore.net/datasets /tmp/datasets
As when uploading data from HDFS to cloud storage, checksums of files are not verified, so updated files whose new length is the same as the previous length will not be downloaded.
Copying Data Between Cloud Storage Containers
You can copy data between cloud storage containers simply by listing the different URLs as the source and destination paths. This includes copying:
Between two Amazon S3 buckets
Between two ADLS containers
Between two WASB containers
Between ADLS and WASB containers
For example, to copy data from one Amazon S3 bucket to an ADLS store, the following command could be used:
hadoop distcp -numListstatusThreads 40 s3a://hwdev-example-ireland/datasets adl://alice.azuredatalakestore.net/datasets
Irrespective of source and destination store locations, when copying data with DistCp, all data passes through the Hadoop cluster: once to read, once to write. This means that the time to perform the copy depends on the size of the Hadoop cluster, and the bandwidth between it and the object stores. The operations may also incur costs for the data downloaded.
Copying Data Within a Cloud Storage Container
Copy operations within a single object store still take place in the Hadoop cluster, even when the object store implements a more efficient copy operation internally. That is, an operation such as
hadoop distcp -numListstatusThreads 40 s3a://bucket/datasets/set1 s3a://bucket/datasets/set2
copies each byte down to the Hadoop worker nodes and back to the bucket. In addition to the operation being slow, it means that charges may be incurred.
Note | |
---|---|
When using DistCP to copy data in S3, even within the same bucket, the current encryption settings of the client are used. This happens irrespective of the encryption settings of the source data. |
Related Links