Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Controlling the Number of Mappers and Their Bandwidth

If you want to control the number of mappers launched for DistCp, you can add the -m option and set it to the desired number of mappers.

When using DistCp from a Hadoop cluster running in cloud infrastructure, increasing the number of mappers may speed up the operation, as well as increase the likelihood that some of the source data will be held on the hosts running the mappers.

Similarly, if copying to a remote a cluster in a different region, it is possible that the bandwidth from the Hadoop cluster to Amazon S3 is the bottleneck. In such a situation, because the bandwidth is shared across all mappers, adding more mappers will not accelerate the upload: it will merely slow all the mappers down.

The -bandwidth option sets the approximate maximum bandwidth for each mapper in Megabytes per second. This a floating point number, so a value such as -bandwidth 0.5 allocates 0.5 MB/s to each mapper.

The Challenge of Store Throttling

Some cloud stores (especially S3) throttle IO operations to a directory trees in their stores: the more load is placed on a directory tree, the more the caller is throttled. Request are either delayed, or actually rejected with a "throttled" error code, after which the client is expected to wait before retrying the operation.

Large distcp operations with many mappers can trigger this throttling, so slowing down the upload. In such a situation, reducing the bandwidth of each mapper can reduce the load the distcp operation places on the store, while still spreading the load around the cluster.

If adding more mappers and reducers to a distcp operation appears to actually slow down the upload, throttling is a possible cause: consider reducing the mapper count.