Copying Data with DistCp
You can use DistCp to copy data between your cluster’s HDFS and your cloud storage. DistCp is a utility for copying large data sets between distributed filesystems. To access DistCp utility, SSH to any node in your cluster.
Copying Data from HDFS to Cloud Storage
To transfer data from HDFS to an Amazon S3 bucket, list the path to HDFS first and the path to the cloud storage second:
hadoop distcp hdfs://source-folder s3a://destination-bucket
Updating Existing Data
If you would like to transfer only the files that don’t already exist in the target
folder, add the update
option to improve the copy
speeds:
hadoop distcp -update hdfs://source-folder s3a://destination-bucket
When copying between Amazon S3 and HDFS, the "update" check only compares file size; it does not use checksums to detect other changes in the data.
Copying Data from Cloud Storage to HDFS
To copy data from your cloud storage container to HDFS, list the path of the cloud storage data first and the path to HDFS second. For example:
hadoop distcp s3a://hwdev-examples-ireland/datasets /tmp/datasets2
This downloads all files.
You can add the update
option to only download data which has
changed:
hadoop distcp -update s3a://hwdev-examples-ireland/datasets /tmp/datasets2
Copying Data Between Cloud Storage Containers
You can copy data between cloud storage containers simply by listing the different URLs as the source and destination paths. This includes copying:
Between two Amazon S3 buckets
Between two ADLS containers
Between two WASB containers
Between ADLS and WASB containers
For example, to copy data from one Amazon S3 bucket to another, use the following syntax:
hadoop distcp s3a://hwdev-example-ireland/datasets s3a://hwdev-example-us/datasets
Irrespective of source and destination bucket locations, when copying data between Amazon S3 buckets, all data passes through the Hadoop cluster: once to read, once to write. This means that the time to perform the copy depends on the size of the Hadoop cluster, and the bandwidth between it and the S3 buckets. Furthermore, even when running within Amazon's own infrastructure, you are billed for your accesses to remote Amazon S3 buckets.
Copying Data Within a Cloud Storage Container
Copy operations within a single object store still take place in the Hadoop cluster, even when the object store implements a more efficient copy operation internally. That is, an operation such as
hadoop distcp s3a://bucket/datasets/set1 s3a://bucket/datasets/set2
copies each byte down to the Hadoop worker nodes and back to the bucket. In addition to the operation being slow, it means that charges may be incurred.
Specifying Per-Bucket DistCp Options for S3
If a bucket has different authentication or endpoint options, then the different options for that bucket can be set with a bucket-specific option. For example, to copy to a remote bucket using Amazon's V4 authentication API requires the explicit S3 endpoint to be declared:
hadoop distcp s3a://hwdev-example-us/datasets/set1 s3a://hwdev-example-frankfurt/datasets/ \ -D fs.s3a.bucket.hwdev-example-frankfurt.endpoint=s3.eu-central-1.amazonaws.com
Similarly, different credentials may be used when copying between buckets of different accounts. When performing such an operation, consider that secrets on the command line can be visible to other users on the system, so potentially insecure.
hadoop distcp s3a://hwdev-example-us/datasets/set1 s3a://hwdev-example-frankfurt/datasets/ \ -D fs.s3a.bucket.hwdev-example-frankfurt.endpoint=s3.eu-central-1.amazonaws.com \ -D fs.s3a.fs.s3a.bucket.hwdev-example-frankfurt.access.key=AKAACCESSKEY-2 \ -D fs.s3a.bucket.nightly.secret.key=SECRETKEY
Using short-lived session keys can reduce the vulnerabilities here, while storing the secrets in hadoop jceks credential files is potentially significantly more secure.
Related Links
Improving Performance for DistCp
Local Space Requirements for Copying to S3