Using DistCp with Highly Available remote clusters

You can use distcp to copy files between highly available clusters by configuring access to the remote cluster with the nameservice ID.

  1. Create a new directory and copy the contents of the /etc/hadoop/conf directory on the local cluster to this directory.
    The local cluster is the cluster where you plan to run the distcp command.
    Specify the new directory for the --config parameter when you run the distcp command in step 5.
  2. In the hdfs-site.xml file in the distcpConf directory, add the nameservice ID for the remote cluster to the dfs.nameservices property.
  3. On the remote cluster, find the hdfs-site.xml file and copy the properties that refer to the nameservice ID to the end of the hdfs-site.xml file in the distcpConf directory you created in step 1.
    • dfs.ha.namenodes.<nameserviceID>
    • dfs.client.failover.proxy.provider.<remote nameserviceID>
    • dfs.ha.automatic-failover.enabled.<remote nameserviceID>
    • dfs.namenode.rpc-address.<nameserviceID>.<namenode1>
    • dfs.namenode.servicerpc-address.<nameserviceID>.<namenode1>
    • dfs.namenode.http-address.<nameserviceID>.<namenode1>
    • dfs.namenode.https-address.<nameserviceID>.<namenode1>
    • dfs.namenode.rpc-address.<nameserviceID>.<namenode2>
    • dfs.namenode.servicerpc-address.<nameserviceID>.<namenode2>
    • dfs.namenode.http-address.<nameserviceID>.<namenode2>
    • dfs.namenode.https-address.<nameserviceID>.<namenode2>

    By default, you can find the hdfs-site.xml file in the /etc/hadoop/conf directory on a node of the remote cluster.

  4. If you changed the nameservice ID for the remote cluster in step 2, update the nameservice ID used in the properties you copied in step 3 with the new nameservice ID, accordingly.

    The following example shows the properties copied from the remote cluster with the following values:

    • A remote nameservice called externalnameservice
    • NameNodes called namenode1 and namenode2
    • A host named remotecluster.com
     <property>
    <name>dfs.ha.namenodes.externalnameservice</name>
    <value>namenode1,namenode2</value>
    </property>
    
    <property>
    <name>dfs.client.failover.proxy.provider.externalnameservice</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    
    <property>
    <name>dfs.ha.automatic-failover.enabled.externalnameservice</name>
    <value>true</value>
    </property>
    
    <property>
    <name>dfs.namenode.rpc-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:8020</value>
    </property>
    
    <property>
    <name>dfs.namenode.servicerpc-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:8022</value>
    </property>
    
    <property>
    <name>dfs.namenode.http-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:20101</value>
    </property>
    
    <property>
    <name>dfs.namenode.https-address.externalnameservice.namenode1</name>
    <value>remotecluster.com:20102</value>
    </property>
    
    <property>
    <name>dfs.namenode.rpc-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:8020</value>
    </property>
    
    <property>
    <name>dfs.namenode.servicerpc-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:8022</value>
    </property>
    
    <property>
    <name>dfs.namenode.http-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:20101</value>
    </property>
    
    <property>
    <name>dfs.namenode.https-address.externalnameservice.namenode2</name>
    <value>remotecluster.com:20102</value>
    </property>
    

    At this point, the hdfs-site.xml file in the distcpConf directory should have both the clusters and four NameNode IDs.

  5. Depending on the use case, the options specified when you run the distcp may differ.
    To copy data from an insecure cluster, run the following command:
    hadoop --config distcpConf distcp hdfs://<nameservice>/<source_directory> <target directory>
    To copy data from a secure cluster, run the following command:
    hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=<nameservice>  hdfs://<nameservice>/<source_directory> <target directory>
    For example:
    hadoop --config distcpConf distcp -Dmapreduce.job.hdfs-servers.token-renewal.exclude=ns1 hdfs://ns1/xyz /tmp/test

    If the distcp source or target are in encryption zones, include the following distcp options: -skipcrccheck -update. The distcp command may fail if you do not include these options when the source or target are in encryption zones because the CRC for the files may differ.

    For distcp between clusters that both use HDFS Transparent Encryption, you must include the exclude parameter.