Using DistCp
Use DistCp to copy files between various clusters.
The most common use of DistCp is an inter-cluster copy:
hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination
Where hdfs://nn1:8020/source
is the data source, and hdfs://nn2:8020/
destination is the destination. This will expand the name space under /source on NameNode "nn1" into a temporary file, partition its contents among a set of map tasks, and start copying from "nn1" to "nn2". Note that DistCp requires absolute paths.
You can also specify multiple source directories:
hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs:// nn2:8020/destination
Or specify multiple source directories from a file with the -f
option:
hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination
Where srclist
contains:
hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b
Distcp with HFTP
After a copy, you should generate and cross-check a listing of the source and destination to verify that the copy was truly successful. Since DistCp employs both Map/Reduce and the FileSystem API, issues in or between any of these three could adversely and silently affect the copy. Some have had success running with -update enabled to perform a second pass, but users should be acquainted with its semantics before attempting this.
It is also worth noting that if another client is still writing to a source file, the copy will likely fail. Attempting to overwrite a file being written at the destination should also fail on HDFS. If a source file is (re)moved before it is copied, the copy will fail with a FileNotFound exception.