Copying Data between two Clusters Using DistCp and Hftp
You can use the DistCp tool on the CDH 5 cluster to initiate the copy job to move the data. Between two clusters running different versions of CDH, run the DistCp tool with hftp:// as the source file system and hdfs:// as the destination file system. This uses the HFTP protocol for the source, and the HDFS protocol for the destination. the default port for HFTP is 50070, and the default port for HDFS is 8020.
Example of a source URI: hftp://namenode-location:50070/basePath
where namenode-location refers to the CDH 4 NameNode hostname as defined by its configured fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.
Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location
This refers to the CDH 5 NameNode as defined by its configured fs.defaultFS.
The basePath in both the above URIs refers to the directory you want to copy, if one is specifically needed.
The DistCp Command
For more help, and to see all the options available on the DistCp tool, use the following command to see the built-in help:
$ hadoop distcp
$ hadoop distcp hftp://cdh4-namenode:50070/ hdfs://CDH5-nameservice/
Or use a specific path, such as /hbase to move HBase data, for example:
$ hadoop distcp hftp://cdh4-namenode:50070/hbase hdfs://CDH5-nameservice/hbase
DistCp will then submit a regular MapReduce job that performs a file-by-file copy.
<< Requirements and Restrictions | Copying Data between a Secure and an Insecure Cluster using DistCp and webHDFS >> | |