Cloud Data Access
Also available as:
PDF
loading table of contents...

Improving Performance for DistCp

ADLS and WASB

You can tune fs.azure.selfthrottling.read.factor and fs.azure.selfthrottling.write.factor. Refer to Maximizing HDInsight throughput to Azure Blob Storage blog post.

Amazon S3

If you are planning to copy large amounts of data between HDFS and S3, you can accelerate the process by passing -D fs.s3a.fast.upload=true while invoking DistCp. For example:

hadoop distcp -D fs.s3a.fast.upload=true  s3a://dominika-test/driver-data /tmp/test2

The fs.s3a.fast.upload option significantly accelerates data upload by writing the data in blocks, possibly in parallel.

For more tips on how to improve performance for DistCp with S3, refer to Configuring and Tuning S3A Fast Upload.