Copying Data between two Clusters Using distcp
You can use the distcp tool on the destination cluster to initiate the copy job to move the data. Between two clusters running different versions of CDH, run the distcp tool with hftp:// as the source file system and hdfs:// as the destination file system. This uses the HFTP protocol for the source and the HDFS protocol for the destination. The default port for HFTP is 50070 and the default port for HDFS is 8020. Amazon S3 block and native filesystems are also supported, using the s3a:// protocol.
Example of a source URI: hftp://namenode-location:50070/basePath
where namenode-location refers to the CDH 4 NameNode hostname as defined by its configured fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.
Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location
This refers to the CDH 5 NameNode as defined by its configured fs.defaultFS.
The basePath in both the above URIs refers to the directory you want to copy, if one is specifically needed.
Example of an Amazon S3 Block Filesystem URI: s3://accessKeyid:secretkey@bucket/file
Example of an Amazon S3 Native Filesystem URI: s3n://accessKeyid:secretkey@bucket/file
The distcp Command
You can use distcp to copy files between compatible clusters in either direction, from or to the source or destination clusters.
For example, when upgrading, say from a CDH 5.7 cluster to a CDH 5.9, you should run distcp from the CDH 5.13 cluster in this manner:
$ hadoop distcp hftp://cdh57-namenode:50070/ hdfs://CDH59-nameservice/ $ hadoop distcp s3a://bucket/ hdfs://CDH59-nameservice/
You can also use a specific path, such as /hbase to move HBase data, for example:
$ hadoop distcp hftp://cdh57-namenode:50070/hbase hdfs://CDH59-nameservice/hbase $ hadoop distcp s3a://bucket/file hdfs://CDH59-nameservice/bucket/file
Protocol Support for distcp
The following table lists support for using different protocols with the distcp command on different versions of CDH. In the table, secure means that the cluster is configured to use Kerberos. Copying between a secure cluster and an insecure cluster is only supported from CDH 5.1.3 onward, due to the inclusion of HDFS-6776.
Source | Destination | Source protocol and configuration | Destination protocol and configuration | Where to issue distcp command | Fallback Configuration Required | Status |
---|---|---|---|---|---|---|
CDH 4 | CDH 4 | hdfs or webhdfs, insecure | hdfs or webhdfs, insecure | Source or destination | ok | |
CDH 4 | CDH 4 | hftp, insecure | hdfs or webhdfs, insecure | Destination | ok | |
CDH 4 | CDH 4 | hdfs or webhdfs, secure | hdfs or webhdfs, secure | Source or destination | ok | |
CDH 4 | CDH 4 | hftp, secure | hdfs or webhdfs, secure | Destination | ok | |
CDH 4 | CDH 5 | webhdfs, insecure | webhdfs or hdfs, insecure | Destination | ok | |
CDH 4 | CDH 5 ( 5.1.3 and newer) | webhdfs, insecure | webhdfs, secure | Destination | yes | ok |
CDH 4 | CDH 5 | webhdfs or hftp, insecure | webhdfs or hdfs, insecure | Destination | ok | |
CDH 4 | CDH 5 | webhdfs or hftp, secure | webhdfs or hdfs, secure | Destination | ok | |
CDH 4 | CDH 5 | hdfs or webhdfs, insecure | webhdfs, insecure | Source | ok | |
CDH 5 | CDH 4 | webhdfs , insecure | webhdfs, insecure | Source or destination | ok | |
CDH 5 | CDH 4 | webhdfs , insecure | hdfs, insecure | Destination | ok | |
CDH 5 | CDH 4 | hdfs, insecure | webhdfs, insecure | Source | ok | |
CDH 5 | CDH 4 | hftp, insecure | hdfs or webhdfs, insecure | Destination | ok | |
CDH 5 | CDH 4 | webhdfs, secure | webhdfs, secure | Source or destination | ok | |
CDH 5 | CDH 4 | webhdfs, secure | hdfs, insecure | Destination | ok | |
CDH 5 | CDH 4 | hdfs, secure | webhdfs, secure | Source | ok | |
CDH 5 | CDH 5 | hdfs or webhdfs, insecure | hdfs or webhdfs, insecure | Source or destination | ok | |
CDH 5 | CDH 5 | hftp, insecure | hdfs or webhdfs, insecure | Destination | ok | |
CDH 5 | CDH 5 | hdfs or webhdfs, secure | hdfs or webhdfs, secure | Source or destination | ok | |
CDH 5 | CDH 5 | hftp, secure | hdfs or webhdfs, secure | Destination | ok | |
CDH 5 | CDH 5 | hdfs or webhdfs, secure | webhdfs, insecure | Source | yes | ok |
CDH 5 | CDH 5 | hdfs or webhdfs, insecure | hdfs or webhdfs, secure | Destination | yes | ok |
<property> <name>ipc.client.fallback-to-simple-auth-allowed</name> <value>true</value> </property>
distcp between Secure Clusters in Distinct Kerberos Realms
Specify the Destination Parameters in krb5.conf
[realms] HADOOP.QA.domain.COM = { kdc = kdc.domain.com:88 admin_server = admin.test.com:749 default_domain = domain.com supported_enctypes = arcfour-hmac:normal des-cbc-crc:normal des-cbc-md5:normal des:normal des:v4 des:norealm des:onlyrealm des:afs3 } [domain_realm] .domain.com = HADOOP.test.domain.COM domain.com = HADOOP.test.domain.COM test03.domain.com = HADOOP.QA.domain.COM
(If SSL is enabled) Specify Truststore Properties
<property> <name>ssl.client.truststore.location</name> <value>path_to_truststore</value> </property> <property> <name>ssl.client.truststore.password</name> <value>XXXXXX</value> </property> <property> <name>ssl.client.truststore.type</name> <value>jks</value> </property>
Set HADOOP_CONF to the Destination Cluster
Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.
Launch Distcp
hadoop distcp hdfs://test01.domain.com:8020/user/alice hdfs://test02.domain.com:8020/user/alice
[libdefaults] udp_preference_limit = 1