Requirements and Restrictions
- The CDH 5 cluster must have a MapReduce service running on it (MRv1 or YARN (MRv2)).
- All the MapReduce nodes in the CDH 5 cluster should have full network access to all the nodes of the source cluster. This allows you to perform the copy in a distributed manner.
- To copy data between a secure and an insecure cluster, you must run the distcp command on the secure cluster.
- To copy data from a CDH 4 to a CDH 5 cluster, you can do one of the
following:Note
: The term source in this case refers to the CDH 4 (or other Hadoop) cluster you want to migrate or copy data from; and destination refers to the CDH 5 cluster.
- Running commands on the destination cluster, use the Hftp protocol
for the source cluster, and HDFS for the destination. (Hftp is read-only, so you must
run DistCp on the destination cluster and pull the data from the source cluster.) See
Copying Data between two Clusters Using DistCp and
Hftp.Note
: Do not use this method if one of the clusters is secure and the other is not.
- Running commands on the source cluster, use the HDFS or webHDFS protocol for the source cluster, and webHDFS for the destination. See Copying Data between a Secure and an Insecure Cluster using DistCp and webHDFS.
- Running commands on the destination cluster, use webHDFS for the source cluster, and webHDFS for the destination. See Copying Data between a Secure and an Insecure Cluster using DistCp and webHDFS.
- Running commands on the destination cluster, use the Hftp protocol
for the source cluster, and HDFS for the destination. (Hftp is read-only, so you must
run DistCp on the destination cluster and pull the data from the source cluster.) See
Copying Data between two Clusters Using DistCp and
Hftp.
The following restrictions currently apply (see Apache Hadoop Known Issues):
- DistCp does not work between a secure cluster and an insecure
cluster in some cases.
As of CDH 5.1.3, DistCp does work between a secure and an insecure cluster if you use the webHDFS protocol and run the command from the secure cluster side after setting ipc.client.fallback-to-simple-auth-allowed to true, as described under Copying Data between a Secure and an Insecure Cluster using DistCp and webHDFS.
- To use DistCp using Hftp from a secure cluster using SPNEGO, you must configure the dfs.https.port property on the client to use the HTTP port (50070 by default).
<< Migrating data between a CDH 4 and CDH 5 cluster | Copying Data between two Clusters Using DistCp and Hftp >> | |