Using DistCp between HA clusters using Cloudera Manager

To copy data between HA clusters using distcp, you must configure specific name service properties to ensure that the HDFS clients in a cluster can access a remote HA cluster. You can configure these values through advanced configuration snippets in Cloudera Manager.

This procedure explains how you can configure the name service properties from Cloudera Manager to enable copying of data between two example clusters A and B. Here, A is the source cluster while B is the remote cluster.

  1. Select Clusters and choose the source HDFS cluster where you want to configure the properties.
  2. Navigate to the Configuration tab and search for HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
  3. Add the various properties as specified:
    1. Add the dfs.internal.nameservices property as the name service of the source cluster.
      dfs.internal.nameservices = HAA
    2. Add the dfs.nameservices property as the name services of both the clusters.
      dfs.nameservices = HAA,HAB
    3. Add the dfs.client.failover.proxy.provider.<nameservice> property for the remote cluster.
      dfs.client.failover.proxy.provider.HAB = org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
    4. Add the dfs.ha.namenodes.<nameservice> property to reference the NameNodes of the remote cluster.
      dfs.ha.namenodes.HAB = nn1,nn2
    5. Add the dfs.namenode.rpc-address and the dfs.namenode.servicerpc-address properties as the fully qualified RPC addresses on which each NameNode of the remote cluster can listen.
      dfs.namenode.rpc-address.HAB.nn1 = <NN1_fqdn>:8020

      dfs.namenode.rpc-address.HAB.nn2 = <NN2_fqdn>:8020

      dfs.namenode.servicerpc-address.HAB.nn1 = <NN1_fqdn>:8022

      dfs.namenode.servicerpc-address.HAB.nn2 = <NN2_fqdn>:8022

    6. Add the dfs.namenode.http-address and the dfs.namenode.https-address properties as the fully qualified HTTP and HTTPS addresses on which each NameNode of the remote cluster can listen.
      dfs.namenode.http-address.HAB.nn1 = <NN1_fqdn>:9870

      dfs.namenode.http-address.HAB.nn2 = <NN2_fqdn>:9870

      dfs.namenode.https-address.HAB.nn1 = <NN1_fqdn>:9871

      dfs.namenode.https-address.HAB.nn2 = <NN2_fqdn>:9871

  4. Specify the Reason for Change, and click Save Changes.
    This saves the property values as client-side configurations.
  5. Restart the HDFS service, then run the distcp command using the NameService. For example:
    hadoop distcp hdfs://HAA/tmp/testDistcp hdfs://HAB/tmp/