Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Improving DistCp Performance

This section includes tips for improving performance when copying large volumes of data between Amazon S3 and HDFS.

The bandwidth between the Hadoop cluster and object store is usually the upper limit to how fast data can be copied into S3. The further the Hadoop cluster is from the store installation, or the narrower the network connection is, the longer the operation will take. Even a Hadoop cluster deployed within cloud infrastructure may encounter network delays from throttled VM network connections.

Network bandwidth limits notwithstanding, there are some options which can be used to tune the performance of an upload: