Copying Data Between Two Clusters Using Distcp

The Distcp Command

The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems within and across clusters. The distcp command submits a regular MapReduce job that performs a file-by-file copy.

To see the distcp command options, run the built-in help:
$ hadoop distcp

Distcp Syntax and Examples

You can use distcp to copy files between compatible clusters in either direction, from or to the source or destination clusters.

For example, when upgrading, say from a CDH 5.7 cluster to a CDH 5.9, you should run distcp from the CDH 5.13 cluster in this manner:

$ hadoop distcp hftp://cdh57-namenode:50070/ hdfs://CDH59-nameservice/
$ hadoop distcp s3a://bucket/ hdfs://CDH59-nameservice/

You can also use a specific path, such as /hbase to move HBase data, for example:

$ hadoop distcp hftp://cdh57-namenode:50070/hbase hdfs://CDH59-nameservice/hbase
$ hadoop distcp s3a://bucket/file hdfs://CDH59-nameservice/bucket/file

HFTP Protocol

The HFTP protocol allows you to use FTP resources in an HTTP request. When copying with distcp across different versions of CDH, use hftp:// for the source file system and hdfs:// for the destination file system, and run distcp from the destination cluster. The default port for HFTP is 50070 and the default port for HDFS is 8020.

Example of a source URI: hftp://namenode-location:50070/basePath

  • hftp:// is the source protocol.
  • namenode-location is the CDH 4 (source) NameNode hostname as defined by its configured fs.default.name.
  • 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.

Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location

  • hdfs:// is the destination protocol
  • nameservice-id or namenode-location is the CDH 5 (destination) NameNode hostname as defined by its configured fs.defaultFS.
  • basePath in both examples refers to the directory you want to copy, if one is specifically needed.

S3 Protocol

Amazon S3 block and native filesystems are also supported with the s3a:// protocol.

Example of an Amazon S3 Block Filesystem URI: s3a://bucket_name/path/to/file

S3 credentials can be provided in a configuration file (for example, core-site.xml):
<property>
    <name>fs.s3a.access.key</name>
    <value>...</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>...</value>
</property>
or run on the command line as follows:
hadoop distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... s3a://

Enabling Fallback Configuration

To enable the fallback configuration, for copying between secure and insecure clusters, add the following to the HDFS configuration file, core-default.xml, by using an advanced configuration snippet if you use Cloudera Manager, or editing the file directly otherwise.
<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>

Protocol Support for Distcp

The following table lists the protocols supported with the distcp command on different versions of CDH. "Secure" means that the cluster is configured to use Kerberos.

Source Destination Where to Issue distcp Command Source Protocol Source Config Destination Protocol Destination Config Fallback Config Required
CDH 4 CDH 4 Destination hftp Secure hdfs or webhdfs Secure  
CDH 4 CDH 4 Source or Destination hdfs or webhdfs Secure hdfs or webhdfs Secure  
CDH 4 CDH 4 Source or Destination hdfs or webhdfs Insecure hdfs or webhdfs Insecure  
CDH 4 CDH 4 Destination hftp Insecure hdfs or webhdfs Insecure  
               
CDH 4 CDH 5 Destination webhdfs or hftp Secure webhdfs or hdfs Secure  
CDH 4 CDH 5.1.3+ Destination webhdfs Insecure webhdfs Secure Yes
CDH 4 CDH 5 Destination webhdfs or hftp Insecure webhdfs or hdfs Insecure  
CDH 4 CDH 5 Source hdfs or webhdfs Insecure webhdfs Insecure  
               
CDH 5 CDH 4 Source or Destination webhdfs Secure webhdfs Secure  
CDH 5 CDH 4 Source hdfs Secure webhdfs Secure  
CDH 5.1.3+ CDH 4 Source hdfs or webhdfs Secure webhdfs Insecure Yes
CDH 5 CDH 4 Source or Destination webhdfs Insecure webhdfs Insecure  
CDH 5 CDH 4 Destination webhdfs Insecure hdfs Insecure  
CDH 5 CDH 4 Source hdfs Insecure webhdfs Insecure  
CDH 5 CDH 4 Destination hftp Insecure hdfs or webhdfs Insecure  
               
CDH 5 CDH 5 Source or Destination hdfs or webhdfs Secure hdfs or webhdfs Secure  
CDH 5 CDH 5 Destination hftp Secure hdfs or webhdfs Secure  
CDH 5.1.3+ CDH 5 Source hdfs or webhdfs Secure hdfs or webhdfs Insecure Yes
CDH 5 CDH 5.1.3+ Destination hdfs or webhdfs Insecure hdfs or webhdfs Secure Yes
CDH 5 CDH 5 Source or Destination hdfs or webhdfs Insecure hdfs or webhdfs Insecure  
CDH 5 CDH 5 Destination hftp Insecure hdfs or webhdfs Insecure  

Distcp between Secure Clusters in Distinct Kerberos Realms

Specify the Destination Parameters in krb5.conf

Edit the krb5.conf file on the client (where the distcp job will be submitted) to include the destination hostname and realm.
[realms]
HADOOP.QA.domain.COM = { kdc = kdc.domain.com:88 admin_server = admin.test.com:749
default_domain = domain.com supported_enctypes = arcfour-hmac:normal des-cbc-crc:normal
des-cbc-md5:normal des:normal des:v4 des:norealm des:onlyrealm des:afs3 } 

[domain_realm]
.domain.com = HADOOP.test.domain.COM
domain.com = HADOOP.test.domain.COM
test03.domain.com = HADOOP.QA.domain.COM

Configure HDFS RPC Protection and Acceptable Kerberos Principal Patterns

Set the hadoop.rpc.protection property to authentication in both clusters. You can modify this property either in hdfs-site.xml, or using Cloudera Manager as follows:
  1. Open the Cloudera Manager Admin Console.
  2. Go to the HDFS service.
  3. Click the Configuration tab.
  4. Select Scope > HDFS-1 (Service-Wide)
  5. Select Category > Security.
  6. Locate the Hadoop RPC Protection property and select authentication.
  7. Click Save Changes to commit the changes.

The following steps are not required if the two realms are already set up to trust each other, or have the same principal pattern. However, this isn't usually the case.

Set the dfs.namenode.kerberos.principal.pattern property to * to allow distcp irrespective of the principal patterns of the source and destination clusters. You can modify this property either in hdfs-site.xml on both clusters, or using Cloudera Manager as follows:
  1. Open the Cloudera Manager Admin Console.
  2. Go to the HDFS service.
  3. Click the Configuration tab.
  4. Select Scope > Gateway
  5. Select Category > Advanced.
  6. Edit the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml property to add:
    <property>
      <name>dfs.namenode.kerberos.principal.pattern</name>
      <value>*</value>
    </property>
  7. Click Save Changes to commit the changes.

(If TLS/SSL is enabled) Specify Truststore Properties

The following properties must be configured in the ssl-client.xml file on the client submitting the distcp job to establish trust between the target and destination clusters.
<property>
<name>ssl.client.truststore.location</name>
<value>path_to_truststore</value>
</property>

<property>
<name>ssl.client.truststore.password</name>
<value>XXXXXX</value>
</property>

<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>

Set HADOOP_CONF to the Destination Cluster

Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.

Launch Distcp

Kinit on the client and launch the distcp job.
hadoop distcp hdfs://test01.domain.com:8020/user/alice hdfs://test02.domain.com:8020/user/alice
If launching distcp fails, force Kerberos to use TCP instead of UDP by adding the following parameter to the krb5.conf file on the client.
[libdefaults]
udp_preference_limit = 1