This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

Copying Data between two Clusters Using DistCp and Hftp

You can use the DistCp tool on the CDH 5 cluster to initiate the copy job to move the data. Between two clusters running different versions of CDH, run the DistCp tool with hftp:// as the source file system and hdfs:// as the destination file system. This uses the HFTP protocol for the source, and the HDFS protocol for the destination. the default port for HFTP is 50070, and the default port for HDFS is 8020.

Example of a source URI: hftp://namenode-location:50070/basePath

where namenode-location refers to the CDH 4 NameNode hostname as defined by its configured fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.

Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location

This refers to the CDH 5 NameNode as defined by its configured fs.defaultFS.

The basePath in both the above URIs refers to the directory you want to copy, if one is specifically needed.

The DistCp Command

For more help, and to see all the options available on the DistCp tool, use the following command to see the built-in help:

$ hadoop distcp
Run the DistCp copy by issuing a command such as the following on the CDH 5 cluster:
  Important: Run the following DistCp commands on the destination cluster only, in this example, the CDH 5 cluster.
$ hadoop distcp hftp://cdh4-namenode:50070/ hdfs://CDH5-nameservice/

Or use a specific path, such as /hbase to move HBase data, for example:

$ hadoop distcp hftp://cdh4-namenode:50070/hbase hdfs://CDH5-nameservice/hbase

DistCp will then submit a regular MapReduce job that performs a file-by-file copy.

Page generated September 3, 2015.