DistCp additional considerations

DistCp provides a strategy to “dynamically” size maps, allowing faster DataNodes to copy more bytes than slower nodes.

Map Sizing

By default, DistCp makes an attempt to size each map comparably so that each copies roughly the same number of bytes. Note that files are the finest level of granularity, so increasing the number of simultaneous copiers (i.e. maps) may not always increase the number of simultaneous copies nor the overall throughput.

Using the dynamic strategy (explained in the Architecture), rather than assigning a fixed set of source files to each map task, files are instead split into several sets. The number of sets exceeds the number of maps, usually by a factor of 2-3. Each map picks up and c opies all files listed in a chunk. When a chunk is exhausted, a new chunk is acquired and processed, until no more chunks remain.

By not assigning a source path to a fixed map, faster map tasks (i.e. DataNodes) are able to consume more chunks -- and thus copy more data -- than slower nodes. While this distribution is not uniform, it is fair with regard to each mapper’s capacity.

The dynamic strategy provides superior performance under most conditions.

Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommended for long-running and regularly run jobs.

Copying Between Major Versions of HDFS

For copying between two different versions of Hadoop, you can usually use WebHDFS.

MapReduce and Other Side-Effects

As mentioned previously, should a map fail to copy one of its inputs, there will be several side-effects.

  • Unless -overwrite is specified, files successfully copied by a previous map will be marked as “skipped” on a re-execution.

  • If a map fails mapreduce.map.maxattempts times, the remaining map tasks will be killed (unless -i is set).

  • If mapreduce.map.speculative is set final and true, the result of the copy is undefined.