Copying Cluster Data Using DistCp

The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems within and across clusters. You can also use distcp to copy data to and from an Amazon S3 bucket. The distcp command submits a regular MapReduce job that performs a file-by-file copy.

To see the distcp command options, run the built-in help:

$ hadoop distcp

Continue reading:

DistCp Syntax and Examples
Using DistCp with Amazon S3
Kerberos Setup Guidelines for Distcp between Secure Clusters
Distcp between Secure Clusters in Distinct Kerberos Realms
Enabling Fallback Configuration
Protocol Support for Distcp

DistCp Syntax and Examples

You can use distcp to copy files between compatible CDH clusters. The basic form of the distcp command only requires a source cluster and a destination cluster:

$ hadoop distcp <source> <destination>

Between the Same CDH Version

Use the following syntax for copying data between clusters that run the same CDH version:

hadoop distcp hdfs://<namenode>:<port> hdfs://<namenode>

For example, the following command copies data from the example-source cluster to the example-dest cluster:

hadoop distcp hdfs://example-source.cloudera.com:50070 hdfs://example-dest.cloudera.com

Port 50070 is the default NameNode port for HDFS.

Between Different CDH Versions

You can use distcp to copy data between different CDH versions. When you do this, adhere to the following guidelines:

The CDH versions must be compatible.
The source cluster must run a lower version of CDH than the destination cluster.
Run the distcp command on the cluster that runs the higher version of CDH, which should be the destination cluster.
Use the webhdfs protocol for the remote cluster.

Use the following syntax:

hadoop distcp webhdfs://<namenode>:<port> hdfs://<namenode>

For example, when using distcp between a CDH 5.7.0 cluster and a cluster that runs a higher version of CDH, use the webhdfs protocol for the CDH 5.7.0 cluster and designate it as the source cluster by listing it as the first cluster in the distcp command.

The following command copies data from a lower versioned source cluster named example-source to a higher versioned destination cluster named example-dest:

hadoop distcp webhdfs://example57-source.cloudera.com:50070 hdfs://example512-dest.cloudera.com

For a Specific Path

You can specify a path, such as /hbase, to move HBase data:

hadoop distcp webhdfs://example-source.cloudera.com:50070/hbase hdfs://example-dest.cloudera.com/hbase

To/from Amazon S3

You can copy data to or from Amazon S3 with the following syntax:

#Copying from S3
hadoop distcp s3a://<bucket>/<data> hdfs://<namenode>/<directory>/

#Copying to S3
hadoop distcp hdfs://<namenode>/<directory> s3a://<bucket>/<data>

This is a basic example of using distcp with S3. For more information, see Using DistCp with Amazon S3.

Using DistCp with Amazon S3

You can copy HDFS files to and from an Amazon S3 instance. You must provision an S3 bucket using Amazon Web Services and obtain the access key and secret key. You can pass these credentials on the distcp command line, or you can reference a credential store to "hide" sensitive credentials so that they do not appear in the console output, configuration files, or log files.

Amazon S3 block and native filesystems are supported with the s3a:// protocol.

Example of an Amazon S3 Block Filesystem URI: s3a://bucket_name/path/to/file

S3 credentials can be provided in a configuration file (for example, core-site.xml):

<property>
    <name>fs.s3a.access.key</name>
    <value>...</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>...</value>
</property>

You can also enter the configurations in the Advanced Configuration Snippet for core-site.xml, which allows Cloudera Manager to manage this configuration. See Custom Configuration.

You can also provide the credentials on the command line:

hadoop distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... s3a://

For example:

hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey hdfs://MyNameservice-id/user/hdfs/mydata s3a://myBucket/mydata_backup

Using a Credential Provider to Secure S3 Credentials

You can run the distcp command without having to enter the access key and secret key on the command line. This prevents these credentials from being exposed in console output, log files, configuration files, and other artifacts. Running the command in this way requires that you provision a credential store to securely store the access key and secret key. The credential store file is saved in HDFS.

To provision credentials in a credential store:

Provision the credentials by running the following commands:

hadoop credential create fs.s3a.access.key -value access_key -provider jceks://hdfs/path_to_credential_store_file
hadoop credential create fs.s3a.secret.key -value secret_key -provider jceks://hdfs/path_to_credential_store_file

For example:

hadoop credential create fs.s3a.access.key -value foobar -provider jceks://hdfs/user/alice/home/keystores/aws.jceks
hadoop credential create fs.s3a.secret.key -value barfoo -provider jceks://hdfs/user/alice/home/keystores/aws.jceks

You can omit the -value option and its value and the command will prompt the user to enter the value.

For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).

Copy the contents of the /etc/hadoop/conf directory to a working directory.

Add the following to the core-site.xml file in the working directory:

<property>
<name>hadoop.security.credential.provider.path</name>
<value>jceks://hdfs/path_to_credential_store_file</value>
</property>

Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
```
export HADOOP_CONF_DIR=path_to_working_directory
```

After completing these steps, you can run the distcp command using the following syntax:

hadoop distcp hdfs://nameservice-id/source_path s3a://destination_path

You can also reference the credential store on the command line, without having to enter it in a copy of the core-site.xml file. You also do not need to set a value for HADOOP_CONF_DIR. Use the following syntax:

hadoop distcp hdfs://source_path s3a://bucket_name/destination_path
-Dhadoop.security.credential.provider.path=jceks://hdfspath_to_credential_store_file

There are additional options for the distcp command. See DistCp Guide (Apache Software Foundation).

Examples of DistCP Commands Using the S3 Protocol and Hidden Credentials

Copying files to Amazon S3

hadoop distcp hdfs://user/hdfs/mydata s3a://myBucket/mydata_backup

Copying files from Amazon S3

hadoop distcp s3a://myBucket/mydata_backup hdfs://user/hdfs/mydata

Copying files to Amazon S3 using the -filters option to exclude specified source files

You specify a file name with the -filters option. The referenced file contains regular expressions, one per line, that define file name patterns to exclude from the distcp job. The pattern specified in the regular expression should match the fully-qualified path of the intended files, including the scheme (hdfs, webhdfs, s3a, etc.). For example, the following are valid expressions for excluding files:

hdfs://x.y.z:8020/a/b/c
webhdfs://x.y.z:50070/a/b/c
s3a://bucket/a/b/c

Reference the file containing the filter expressions using -filters option. For example:

hadoop distcp -filters /user/joe/myFilters hdfs://user/hdfs/mydata s3a://myBucket/mydata_backup

Contents of the sample myFilters file:

.*foo.*
.*/bar/.*
hdfs://x.y.z:8020/tmp/.*
hdfs://x.y.z:8020/tmp1/file1

The regular expressions in the myFilters exclude the following files:

.*foo.* – excludes paths that contain the string "foo".
.*/bar/.* – excludes paths that include a directory named bar.
hdfs://x.y.z:8020/tmp/.* – excludes all files in the /tmp directory.
hdfs://x.y.z:8020/tmp1/file1 – excludes the file /tmp1/file1.

Copying files to Amazon S3 with the -overwrite option.

The -overwrite option overwrites destination files that already exist.

hadoop distcp -overwrite hdfs://user/hdfs/mydata  s3a://user/mydata_backup

For more information about the -filters, -overwrite, and other options, see DIstCp Guide: Command Line Options (Apache Software Foundation).

Kerberos Setup Guidelines for Distcp between Secure Clusters

The guidelines mentioned in this section are only applicable for the following sample deployment:

Let's assume you have two clusters with the realms: SOURCE and DESTINATION
You have data that needs to be copied from SOURCE to DESTINATION
Trust exists between SOURCE and Active Directory, and DESTINATION and Active Directory.
Both SOURCE and DESTINATION clusters are running CDH 5.3.4 or higher

If your environment matches the one described above, use the following table to configure Kerberos delegation tokens on your cluster so that you can successfully distcp across two secure clusters. Based on the direction of the trust between the SOURCE and DESTINATION clusters, you can use the mapreduce.job.hdfs-servers.token-renewal.exclude property to instruct ResourceManagers on either cluster to skip or perform delegation token renewal for NameNode hosts.

Environment Type		Kerberos Delegation Token Setting
`SOURCE` trusts `DESTINATION`	Distcp job runs on the `DESTINATION` cluster	You do not need to set the `mapreduce.job.hdfs-servers.token-renewal.exclude` property.
`SOURCE` trusts `DESTINATION`	Distcp job runs on the `SOURCE` cluster	Set the `mapreduce.job.hdfs-servers.token-renewal.exclude` property to a comma-separated list of the hostnames of the NameNodes of the `DESTINATION` cluster.
`DESTINATION` trusts `SOURCE`	Distcp job runs on the `DESTINATION` cluster	Set the `mapreduce.job.hdfs-servers.token-renewal.exclude` property to a comma-separated list of the hostnames of the NameNodes of the `SOURCE` cluster.
`DESTINATION` trusts `SOURCE`	Distcp job runs on the `SOURCE` cluster	You do not need to set the `mapreduce.job.hdfs-servers.token-renewal.exclude` property.
Both `SOURCE` and `DESTINATION` trust each other	You do not need to set the `mapreduce.job.hdfs-servers.token-renewal.exclude` property.
Neither `SOURCE` nor `DESTINATION` trusts the other	If a common realm is usable (such as Active Directory), set the `mapreduce.job.hdfs-servers.token-renewal.exclude` property to a comma-separated list of hostnames of the NameNodes of the cluster that is not running the distcp job. For example, if you are running the job on the `DESTINATION` cluster: `kinit` on any `DESTINATION` YARN Gateway host using an AD account that can be used on both `SOURCE` and `DESTINATION`. Run the distcp job as the `hadoop` user: $ hadoop distcp -Ddfs.namenode.kerberos.principal.pattern=* \ -Dmapreduce.job.hdfs-servers.token-renewal.exclude=SOURCE-nn-host1,SOURCE-nn-host2 \ hdfs://source-nn-nameservice/source/path \ /destination/path By default, the YARN ResourceManager renews tokens for applications. The `mapreduce.job.hdfs-servers.token-renewal.exclude` property instructs ResourceManagers on either cluster to skip delegation token renewal for NameNode hosts.

Distcp between Secure Clusters in Distinct Kerberos Realms

This section explains how to copy data between two secure clusters in distinct Kerberos realms.

Specify the Destination Parameters in `krb5.conf`

Edit the krb5.conf file on the client (where the distcp job will be submitted) to include the destination hostname and realm.

[realms]
HADOOP.QA.domain.COM = { kdc = kdc.domain.com:88 admin_server = admin.test.com:749
default_domain = domain.com supported_enctypes = arcfour-hmac:normal des-cbc-crc:normal
des-cbc-md5:normal des:normal des:v4 des:norealm des:onlyrealm des:afs3 } 

[domain_realm]
.domain.com = HADOOP.test.domain.COM
domain.com = HADOOP.test.domain.COM
test03.domain.com = HADOOP.QA.domain.COM

Configure HDFS RPC Protection and Acceptable Kerberos Principal Patterns

Set the hadoop.rpc.protection property to authentication in both clusters. You can modify this property either in hdfs-site.xml, or using Cloudera Manager as follows:

Open the Cloudera Manager Admin Console.
Go to the HDFS service.
Click the Configuration tab.
Select Scope > HDFS-1 (Service-Wide).
Select Category > Security.
Locate the Hadoop RPC Protection property and select authentication.
Click Save Changes to commit the changes.

The following steps are not required if the two realms are already set up to trust each other, or have the same principal pattern. However, this isn't usually the case.

Set the dfs.namenode.kerberos.principal.pattern property to * to allow distcp irrespective of the principal patterns of the source and destination clusters. You can modify this property either in hdfs-site.xml on both clusters, or using Cloudera Manager as follows:

Open the Cloudera Manager Admin Console.
Go to the HDFS service.
Click the Configuration tab.
Select Scope > Gateway.
Select Category > Advanced.
Edit the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml property to add:
```
<property>
  <name>dfs.namenode.kerberos.principal.pattern</name>
  <value>*</value>
</property>
```
Click Save Changes to commit the changes.

(If TLS/SSL is enabled) Specify Truststore Properties

The following properties must be configured in the ssl-client.xml file on the client submitting the distcp job to establish trust between the target and destination clusters.

<property>
<name>ssl.client.truststore.location</name>
<value>path_to_truststore</value>
</property>

<property>
<name>ssl.client.truststore.password</name>
<value>XXXXXX</value>
</property>

<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>

Set `HADOOP_CONF` to the Destination Cluster

Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.

Launch Distcp

Kinit on the client and launch the distcp job.

hadoop distcp hdfs://test01.domain.com:8020/user/alice hdfs://test02.domain.com:8020/user/alice

If launching distcp fails, force Kerberos to use TCP instead of UDP by adding the following parameter to the krb5.conf file on the client.

[libdefaults]
udp_preference_limit = 1

Enabling Fallback Configuration

To enable the fallback configuration, for copying between secure and insecure clusters, add the following to the HDFS configuration file, core-default.xml, by using an advanced configuration snippet if you use Cloudera Manager, or editing the file directly otherwise.

<property>
  <name>ipc.client.fallback-to-simple-auth-allowed</name>
  <value>true</value>
</property>

Protocol Support for Distcp

The following table lists the protocols supported with the distcp command on different versions of CDH. "Secure" means that the cluster is configured to use Kerberos.

Source	Destination	Where to Issue distcp Command	Source Protocol	Source Config	Destination Protocol	Destination Config	Fallback Config Required
CDH 4	CDH 4	Destination	webhdfs	Secure	hdfs or webhdfs	Secure
CDH 4	CDH 4	Source or Destination	hdfs or webhdfs	Secure	hdfs or webhdfs	Secure
CDH 4	CDH 4	Source or Destination	hdfs or webhdfs	Insecure	hdfs or webhdfs	Insecure
CDH 4	CDH 4	Destination		Insecure	hdfs or webhdfs	Insecure

CDH 4	CDH 5	Destination	webhdfs	Secure	webhdfs or hdfs	Secure
CDH 4	CDH 5.1.3+	Destination	webhdfs	Insecure	webhdfs	Secure	Yes
CDH 4	CDH 5	Destination	webhdfs	Insecure	webhdfs or hdfs	Insecure
CDH 4	CDH 5	Source	hdfs or webhdfs	Insecure	webhdfs	Insecure

CDH 5	CDH 4	Source or Destination	webhdfs	Secure	webhdfs	Secure
CDH 5	CDH 4	Source	hdfs	Secure	webhdfs	Secure
CDH 5.1.3+	CDH 4	Source	hdfs or webhdfs	Secure	webhdfs	Insecure	Yes
CDH 5	CDH 4	Source or Destination	webhdfs	Insecure	webhdfs	Insecure
CDH 5	CDH 4	Destination	webhdfs	Insecure	hdfs	Insecure
CDH 5	CDH 4	Source	hdfs	Insecure	webhdfs	Insecure
CDH 5	CDH 4	Destination	webhdfs	Insecure	hdfs or webhdfs	Insecure

CDH 5	CDH 5	Source or Destination	hdfs or webhdfs	Secure	hdfs or webhdfs	Secure
CDH 5	CDH 5	Destination	webhdfs	Secure	hdfs or webhdfs	Secure
CDH 5.1.3+	CDH 5	Source	hdfs or webhdfs	Secure	hdfs or webhdfs	Insecure	Yes
CDH 5	CDH 5.1.3+	Destination	hdfs or webhdfs	Insecure	hdfs or webhdfs	Secure	Yes
CDH 5	CDH 5	Source or Destination	hdfs or webhdfs	Insecure	hdfs or webhdfs	Insecure
CDH 5	CDH 5	Destination	webhdfs	Insecure	hdfs or webhdfs	Insecure

Migrating Data between Clusters Using distcp

Copying Data between a Secure and an Insecure Cluster using DistCp and WebHDFS