Administering HDFS
Also available as:

DistCp between HA clusters

To copy data between HA clusters, use the dfs.internal.nameservices property in the hdfs-site.xml file to explicitly specify the name services belonging to the local cluster, while continuing to use the dfs.nameservices property to specify all of the name services in the local and remote clusters.

Use the following steps to copy data between HA clusters:

Modify the following properties in the hdfs-site.xml file for both cluster A and cluster B:

  1. Add both name services to dfs.nameservices = HAA, HAB
  2. Add the dfs.internal.nameservices property:
    • In cluster A:

      dfs.internal.nameservices = HAA

    • In cluster B:

      dfs.internal.nameservices = HAB

  3. Add dfs.ha.namenodes.<nameservice> to both clusters:
    • In cluster A

      dfs.ha.namenodes.HAB = nn1,nn2

    • In cluster B

      dfs.ha.namenodes.HAA = nn1,nn2

  4. Add the dfs.namenode.rpc-address.<cluster>.<nn> property:
    • In Cluster A:

      dfs.namenode.rpc-address.HAB.nn1 = <NN1_fqdn>:8020

      dfs.namenode.rpc-address.HAB.nn2 = <NN2_fqdn>:8020

    • In Cluster B:

      dfs.namenode.rpc-address.HAA.nn1 = <NN1_fqdn>:8020

      dfs.namenode.rpc-address.HAA.nn2 = <NN2_fqdn>:8020

  5. Add the following properties to enable distcp over WebHDFS and secure WebHDFS:
    • In Cluster A:

      dfs.namenode.http-address.HAB.nn1 = <NN1_fqdn>:50070

      dfs.namenode.http-address.HAB.nn2 = <NN2_fqdn>:50070

      dfs.namenode.https-address.HAB.nn1 = <NN1_fqdn>:50470

      dfs.namenode.https-address.HAB.nn2 = <NN2_fqdn>:50470

    • In Cluster B:

      dfs.namenode.http-address.HAA.nn1 = <NN1_fqdn>:50070

      dfs.namenode.http-address.HAA.nn2 = <NN2_fqdn>:50070

      dfs.namenode.https-address.HAA.nn1 = <NN1_fqdn>:50470

      dfs.namenode.https-address.HAA.nn2 = <NN2_fqdn>:50470

  6. Add the dfs.client.failover.proxy.provider.<cluster> property:
    • In cluster A:

      dfs.client.failover.proxy.provider. HAB = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

    • In cluster B:

      dfs.client.failover.proxy.provider. HAA = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

  7. Restart the HDFS service, then run the distcp command using the NameService. For example:
    hadoop distcp hdfs://HAA/tmp/testDistcp hdfs://HAB/tmp/