DistCp between HA clusters

To copy data between HA clusters, use the dfs.internal.nameservices property in the hdfs-site.xml file to explicitly specify the name services belonging to the local cluster, while continuing to use the dfs.nameservices property to specify all of the name services in the local and remote clusters.

Use the following steps to copy data between HA clusters:

Edit the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml for both cluster A and cluster B:

  1. Open the Cloudera Manager Admin Console.
  2. Go to the HDFS service.
  3. Click the Configuration tab.
  4. Select Scope > Gateway.
  5. Select Category > Advanced.
  6. Search for HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml, and add the various properties as specified:
    1. Add both name services to dfs.nameservices = HAA, HAB
    2. Add the dfs.internal.nameservices property:
      • In cluster A:

        dfs.internal.nameservices = HAA

      • In cluster B:

        dfs.internal.nameservices = HAB

    3. Add dfs.ha.namenodes.<nameservice> to both clusters:
      • In cluster A

        dfs.ha.namenodes.HAB = nn1,nn2

      • In cluster B

        dfs.ha.namenodes.HAA = nn1,nn2

    4. Add the dfs.namenode.rpc-address.<cluster>.<nn> property:
      • In Cluster A:

        dfs.namenode.rpc-address.HAB.nn1 = <NN1_fqdn>:8020

        dfs.namenode.rpc-address.HAB.nn2 = <NN2_fqdn>:8020

      • In Cluster B:

        dfs.namenode.rpc-address.HAA.nn1 = <NN1_fqdn>:8020

        dfs.namenode.rpc-address.HAA.nn2 = <NN2_fqdn>:8020

    5. Add the following properties to enable distcp over WebHDFS and secure WebHDFS:
      • In Cluster A:

        dfs.namenode.http-address.HAB.nn1 = <NN1_fqdn>:50070

        dfs.namenode.http-address.HAB.nn2 = <NN2_fqdn>:50070

        dfs.namenode.https-address.HAB.nn1 = <NN1_fqdn>:50470

        dfs.namenode.https-address.HAB.nn2 = <NN2_fqdn>:50470

      • In Cluster B:

        dfs.namenode.http-address.HAA.nn1 = <NN1_fqdn>:50070

        dfs.namenode.http-address.HAA.nn2 = <NN2_fqdn>:50070

        dfs.namenode.https-address.HAA.nn1 = <NN1_fqdn>:50470

        dfs.namenode.https-address.HAA.nn2 = <NN2_fqdn>:50470

    6. Add the dfs.client.failover.proxy.provider.<cluster> property:
      • In cluster A:

        dfs.client.failover.proxy.provider. HAB = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

      • In cluster B:

        dfs.client.failover.proxy.provider. HAA = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

    7. Restart the HDFS service, then run the distcp command using the NameService. For example:
      hadoop distcp hdfs://HAA/tmp/testDistcp hdfs://HAB/tmp/