Copying Data with DistCp

You can use DistCp to copy data between your cluster’s HDFS filesystem and your cloud storage. DistCp is a utility for copying large data sets between distributed filesystems. To access DistCp utility, SSH to any node in your cluster.

Copying Data from HDFS to Cloud Storage

To transfer data from HDFS to an object store, list the path to HDFS first and the path to the cloud storage second:

hadoop distcp hdfs://source-folder s3a://destination-bucket

Updating Existing Data

If you would like to transfer only the files that don’t already exist in the target folder, add the update option to improve the copy speeds:

hadoop distcp -update -skipcrccheck -numListstatusThreads 40 hdfs://source-folder s3a://destination-bucket

When copying between cloud object stores and HDFS, the "update" check only compares file size; it does not use checksums to detect other changes in the data.

Copying Data from Cloud Storage to HDFS

To copy data from your cloud storage container to HDFS, list the path of the cloud storage data first and the path to HDFS second. For example:

hadoop distcp adl://alice.azuredatalakestore.net/datasets /tmp/datasets

This downloads all files from the path in the ADLS bucket to /tmp/datasets in the cluster filesystem.

You can add the -update option to only download data which has changed:

hadoop distcp -update -skipcrccheck adl://alice.azuredatalakestore.net/datasets /tmp/datasets

As when uploading data from HDFS to cloud storage, checksums of files are not verified, so updated files whose new length is the same as the previous length will not be downloaded.

Copying Data Between Cloud Storage Containers

You can copy data between cloud storage containers simply by listing the different URLs as the source and destination paths. This includes copying:

Between two Amazon S3 buckets
Between two ADLS containers
Between two WASB containers
Between ADLS and WASB containers

For example, to copy data from one Amazon S3 bucket to an ADLS store, the following command could be used:

hadoop distcp -numListstatusThreads 40 s3a://hwdev-example-ireland/datasets adl://alice.azuredatalakestore.net/datasets

Irrespective of source and destination store locations, when copying data with DistCp, all data passes through the Hadoop cluster: once to read, once to write. This means that the time to perform the copy depends on the size of the Hadoop cluster, and the bandwidth between it and the object stores. The operations may also incur costs for the data downloaded.

Copying Data Within a Cloud Storage Container

Copy operations within a single object store still take place in the Hadoop cluster, even when the object store implements a more efficient copy operation internally. That is, an operation such as

hadoop distcp -numListstatusThreads 40 s3a://bucket/datasets/set1 s3a://bucket/datasets/set2

copies each byte down to the Hadoop worker nodes and back to the bucket. In addition to the operation being slow, it means that charges may be incurred.

	Note
	When using DistCP to copy data in S3, even within the same bucket, the current encryption settings of the client are used. This happens irrespective of the encryption settings of the source data.

Related Links

Using DistCp with S3

Using DistCp with Azure ADLS and WASB

DistCp and Proxy Settings

Apache DistCp documentation

​Copying Data with DistCp

Copying Data with DistCp