Cloud Data Access
Also available as:
loading table of contents...

Copying Data with DistCp

You can use DistCp to copy data between your cluster’s HDFS and your cloud storage. DistCp is a utility for copying large data sets between distributed filesystems. To access DistCp utility, SSH to any node in your cluster.

Copying Data from HDFS to Cloud Storage

To transfer data from HDFS to an Amazon S3 bucket, list the path to HDFS first and the path to the cloud storage second:

hadoop distcp hdfs://source-folder s3a://destination-bucket

Updating Existing Data

If you would like to transfer only the files that don’t already exist in the target folder, add the update option to improve the copy speeds:

hadoop distcp -update hdfs://source-folder s3a://destination-bucket

When copying between Amazon S3 and HDFS, the "update" check only compares file size; it does not use checksums to detect other changes in the data.

Copying Data from Cloud Storage to HDFS

To copy data from your cloud storage container to HDFS, list the path of the cloud storage data first and the path to HDFS second. For example:

hadoop distcp s3a://hwdev-examples-ireland/datasets /tmp/datasets2

This downloads all files.

You can add the update option to only download data which has changed:

hadoop distcp -update s3a://hwdev-examples-ireland/datasets /tmp/datasets2

Copying Data Between Cloud Storage Containers

You can copy data between cloud storage containers simply by listing the different URLs as the source and destination paths. This includes copying:

  • Between two Amazon S3 buckets

  • Between two ADLS containers

  • Between two WASB containers

  • Between ADLS and WASB containers

For example, to copy data from one Amazon S3 bucket to another, use the following syntax:

hadoop distcp s3a://hwdev-example-ireland/datasets s3a://hwdev-example-us/datasets

Irrespective of source and destination bucket locations, when copying data between Amazon S3 buckets, all data passes through the Hadoop cluster: once to read, once to write. This means that the time to perform the copy depends on the size of the Hadoop cluster, and the bandwidth between it and the S3 buckets. Furthermore, even when running within Amazon's own infrastructure, you are billed for your accesses to remote Amazon S3 buckets.

Copying Data Within a Cloud Storage Container

Copy operations within a single object store still take place in the Hadoop cluster, even when the object store implements a more efficient copy operation internally. That is, an operation such as

hadoop distcp s3a://bucket/datasets/set1 s3a://bucket/datasets/set2

copies each byte down to the Hadoop worker nodes and back to the bucket. In addition to the operation being slow, it means that charges may be incurred.

Specifying Per-Bucket DistCp Options for S3

If a bucket has different authentication or endpoint options, then the different options for that bucket can be set with a bucket-specific option. For example, to copy to a remote bucket using Amazon's V4 authentication API requires the explicit S3 endpoint to be declared:

hadoop distcp s3a://hwdev-example-us/datasets/set1 s3a://hwdev-example-frankfurt/datasets/ \

Similarly, different credentials may be used when copying between buckets of different accounts. When performing such an operation, consider that secrets on the command line can be visible to other users on the system, so potentially insecure.

hadoop distcp s3a://hwdev-example-us/datasets/set1 s3a://hwdev-example-frankfurt/datasets/ \
  -D \
  -D fs.s3a.fs.s3a.bucket.hwdev-example-frankfurt.access.key=AKAACCESSKEY-2 \
  -D fs.s3a.bucket.nightly.secret.key=SECRETKEY

Using short-lived session keys can reduce the vulnerabilities here, while storing the secrets in hadoop jceks credential files is potentially significantly more secure.

Related Links

Improving Performance for DistCp

Local Space Requirements for Copying to S3

Limitations When Using DistCp with S3

Apache DistCp documentation