Using DistCp with Highly Available remote clusters
You can use distcp to copy files between highly available clusters
by configuring access to the remote cluster with the nameservice ID.
-
Create a new directory and copy the contents of the
/etc/hadoop/confdirectory on the local cluster to this directory.The local cluster is the cluster where you plan to run thedistcpcommand.Specify the new directory for the--configparameter when you run thedistcpcommand in step 5. -
In the
hdfs-site.xmlfile in thedistcpConfdirectory, add the nameservice ID for the remote cluster to thedfs.nameservicesproperty. -
On the remote cluster, find the
hdfs-site.xmlfile and copy the properties that refer to the nameservice ID to the end of thehdfs-site.xmlfile in thedistcpConfdirectory you created in step 1.dfs.ha.namenodes.<nameserviceID>dfs.client.failover.proxy.provider.<remote nameserviceID>dfs.ha.automatic-failover.enabled.<remote nameserviceID>dfs.namenode.rpc-address.<nameserviceID>.<namenode1>dfs.namenode.servicerpc-address.<nameserviceID>.<namenode1>dfs.namenode.http-address.<nameserviceID>.<namenode1>dfs.namenode.https-address.<nameserviceID>.<namenode1>dfs.namenode.rpc-address.<nameserviceID>.<namenode2>dfs.namenode.servicerpc-address.<nameserviceID>.<namenode2>dfs.namenode.http-address.<nameserviceID>.<namenode2>dfs.namenode.https-address.<nameserviceID>.<namenode2>
By default, you can find the
hdfs-site.xmlfile in the/etc/hadoop/confdirectory on a node of the remote cluster. -
If you changed the nameservice ID for the remote cluster in step 2, update the
nameservice ID used in the properties you copied in step 3 with the new
nameservice ID, accordingly.
The following example shows the properties copied from the remote cluster with the following values:
- A remote nameservice called
externalnameservice - NameNodes called
namenode1andnamenode2 - A host named
remotecluster.com
<property> <name>dfs.ha.namenodes.externalnameservice</name> <value>namenode1,namenode2</value> </property> <property> <name>dfs.client.failover.proxy.provider.externalnameservice</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.automatic-failover.enabled.externalnameservice</name> <value>true</value> </property> <property> <name>dfs.namenode.rpc-address.externalnameservice.namenode1</name> <value>remotecluster.com:8020</value> </property> <property> <name>dfs.namenode.servicerpc-address.externalnameservice.namenode1</name> <value>remotecluster.com:8022</value> </property> <property> <name>dfs.namenode.http-address.externalnameservice.namenode1</name> <value>remotecluster.com:20101</value> </property> <property> <name>dfs.namenode.https-address.externalnameservice.namenode1</name> <value>remotecluster.com:20102</value> </property> <property> <name>dfs.namenode.rpc-address.externalnameservice.namenode2</name> <value>remotecluster.com:8020</value> </property> <property> <name>dfs.namenode.servicerpc-address.externalnameservice.namenode2</name> <value>remotecluster.com:8022</value> </property> <property> <name>dfs.namenode.http-address.externalnameservice.namenode2</name> <value>remotecluster.com:20101</value> </property> <property> <name>dfs.namenode.https-address.externalnameservice.namenode2</name> <value>remotecluster.com:20102</value> </property>At this point, the
hdfs-site.xmlfile in thedistcpConfdirectory should have both the clusters and four NameNode IDs. - A remote nameservice called
-
Depending on the use case, the options specified when you run the
distcpmay differ.To copy data from an insecure cluster, run the following command:To copy data from a secure cluster, run the following command:hadoop --config distcpConf distcp hdfs://<nameservice>/<source_directory> <target directory>
For example:hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=<nameservice> hdfs://<nameservice>/<source_directory> <target directory>hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=ns1 hdfs://ns1/xyz /tmp/testIf the
distcpsource or target are in encryption zones, include the followingdistcpoptions:-skipcrccheck -update. Thedistcpcommand may fail if you do not include these options when the source or target are in encryption zones because the CRC for the files may differ.For
distcpbetween clusters that both use HDFS Transparent Encryption, you must include the exclude parameter.
