DistCp Frequently Asked Questions

There are differences between DistCp latest version and the legacy DistCp versions.

  • Why does -update not create the parent source directory under a pre-existing target directory? The behavior of -update and -overwrite is described in detail in the Using DistCp section of this document. In short, if either option is used with a pre-existing destination directory, the contents of each source directory are copied over, rather than the source directory itself. This behavior is consistent with the legacy DistCp implementation.

  • Why does DistCp not run faster when more maps are specified? By default, the smallest unit of work for DistCp is a file. i.e., a file is processed by only one map. Increasing the number of maps to a value exceeding the number of files would yield no performance benefit. The number of maps launched would equal the number of files. To speed up the transfer of very large files, use the -blocksperchunk option to split blocks of a file into multiple chunks.

  • Why does DistCp run out of memory? If the number of individual files/directories being copied from the source path(s) is extremely large (e.g. 1,000,000 paths), DistCp might run out of memory while determining the list of paths for copy. This is not unique to the new DistCp implementation. To get around this, consider changing the -Xmx JVM heap- size parameters, as follows:

    bash$ export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m"
     bash$ hadoop distcp /source /target