Using the DistCp tool
Use DistCp to copy files between various clusters.
The most common use of DistCp is an inter-cluster copy:
hadoop distcp hdfs://nn1:8020/source
hdfs://nn2:8020/destination
Where hdfs://nn1:8020/source
is the data source, and
hdfs://nn2:8020/ destination
is the destination.
This will expand the name space under /source on NameNode "nn1" into a temporary file, partition its contents among a set of map tasks, and start copying from "nn1" to "nn2". Note that DistCp requires absolute paths.
You can also specify multiple source directories:
hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs://
nn2:8020/destination
Or specify multiple source directories from a file with the -f option:
hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination
Where srclist contains:
hdfs://nn1:8020/source/a
hdfs://nn1:8020/source/b