Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Specifying Per-Bucket DistCp Options for S3 Buckets

If a source or destination bucket has different authentication or endpoint options, then the different options for that bucket can be set with a bucket-specific option. For example, to copy to a remote bucket using Amazon's V4 authentication API requires the explicit S3 endpoint to be declared:

hadoop distcp \
  -D fs.s3a.bucket.hwdev-example-frankfurt.endpoint=s3.eu-central-1.amazonaws.com \
  -update -skipcrccheck -numListstatusThreads 40 \
  s3a://hwdev-example-us/datasets/set1 \
  s3a://hwdev-example-frankfurt/datasets/ \         

Similarly, different credentials may be used when copying between buckets of different accounts. When performing such an operation, consider that secrets on the command line can be visible to other users on the system, so potentially insecure.

hadoop distcp \
  -D fs.s3a.bucket.hwdev-example-frankfurt.endpoint=s3.eu-central-1.amazonaws.com \
  -D fs.s3a.fs.s3a.bucket.hwdev-example-frankfurt.access.key=AKAACCESSKEY-2 \
  -D fs.s3a.bucket.nightly.secret.key=SECRETKEY \
  -update -skipcrccheck -numListstatusThreads 40 \
  s3a://hwdev-example-us/datasets/set1 s3a://hwdev-example-frankfurt/datasets/     

Using short-lived session keys can reduce the vulnerabilities here, while storing the secrets in hadoop jceks credential files is potentially significantly more secure.