Using DistCp with Highly Available remote clusters
You can use distcp
to copy files between highly available clusters
by configuring access to the remote cluster with the nameservice ID.
-
Create a new directory and copy the contents of the
/etc/hadoop/conf
directory on the local cluster to this directory.The local cluster is the cluster where you plan to run thedistcp
command.Specify the new directory for the--config
parameter when you run thedistcp
command in step 5. -
In the
hdfs-site.xml
file in thedistcpConf
directory, add the nameservice ID for the remote cluster to thedfs.nameservices
property. -
On the remote cluster, find the
hdfs-site.xml
file and copy the properties that refer to the nameservice ID to the end of thehdfs-site.xml
file in thedistcpConf
directory you created in step 1.dfs.ha.namenodes.<nameserviceID>
dfs.client.failover.proxy.provider.<remote nameserviceID>
dfs.ha.automatic-failover.enabled.<remote nameserviceID>
dfs.namenode.rpc-address.<nameserviceID>.<namenode1>
dfs.namenode.servicerpc-address.<nameserviceID>.<namenode1>
dfs.namenode.http-address.<nameserviceID>.<namenode1>
dfs.namenode.https-address.<nameserviceID>.<namenode1>
dfs.namenode.rpc-address.<nameserviceID>.<namenode2>
dfs.namenode.servicerpc-address.<nameserviceID>.<namenode2>
dfs.namenode.http-address.<nameserviceID>.<namenode2>
dfs.namenode.https-address.<nameserviceID>.<namenode2>
By default, you can find the
hdfs-site.xml
file in the/etc/hadoop/conf
directory on a node of the remote cluster. -
If you changed the nameservice ID for the remote cluster in step 2, update the
nameservice ID used in the properties you copied in step 3 with the new
nameservice ID, accordingly.
The following example shows the properties copied from the remote cluster with the following values:
- A remote nameservice called
externalnameservice
- NameNodes called
namenode1
andnamenode2
- A host named
remotecluster.com
<property> <name>dfs.ha.namenodes.externalnameservice</name> <value>namenode1,namenode2</value> </property> <property> <name>dfs.client.failover.proxy.provider.externalnameservice</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.automatic-failover.enabled.externalnameservice</name> <value>true</value> </property> <property> <name>dfs.namenode.rpc-address.externalnameservice.namenode1</name> <value>remotecluster.com:8020</value> </property> <property> <name>dfs.namenode.servicerpc-address.externalnameservice.namenode1</name> <value>remotecluster.com:8022</value> </property> <property> <name>dfs.namenode.http-address.externalnameservice.namenode1</name> <value>remotecluster.com:20101</value> </property> <property> <name>dfs.namenode.https-address.externalnameservice.namenode1</name> <value>remotecluster.com:20102</value> </property> <property> <name>dfs.namenode.rpc-address.externalnameservice.namenode2</name> <value>remotecluster.com:8020</value> </property> <property> <name>dfs.namenode.servicerpc-address.externalnameservice.namenode2</name> <value>remotecluster.com:8022</value> </property> <property> <name>dfs.namenode.http-address.externalnameservice.namenode2</name> <value>remotecluster.com:20101</value> </property> <property> <name>dfs.namenode.https-address.externalnameservice.namenode2</name> <value>remotecluster.com:20102</value> </property>
At this point, the
hdfs-site.xml
file in thedistcpConf
directory should have both the clusters and four NameNode IDs. - A remote nameservice called
-
Depending on the use case, the options specified when you run the
distcp
may differ.To copy data from an insecure cluster, run the following command:To copy data from a secure cluster, run the following command:hadoop --config distcpConf distcp hdfs://<nameservice>/<source_directory> <target directory>
For example:hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=<nameservice> hdfs://<nameservice>/<source_directory> <target directory>
hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=ns1 hdfs://ns1/xyz /tmp/test
If the
distcp
source or target are in encryption zones, include the followingdistcp
options:-skipcrccheck -update
. Thedistcp
command may fail if you do not include these options when the source or target are in encryption zones because the CRC for the files may differ.For
distcp
between clusters that both use HDFS Transparent Encryption, you must include the exclude parameter.