Copying Cluster Data Using DistCp
The distributed copy command, distcp, is a general utility for copying large data sets between distributed filesystems within and across clusters. You can also use distcp to copy data to and from an Amazon S3 bucket. The distcp command submits a regular MapReduce job that performs a file-by-file copy.
$ hadoop distcp
DistCp Syntax and Examples
You can use distcp to copy files between compatible clusters in either direction, from or to the source or destination clusters.
For example, when upgrading, say from a CDH 5.7 cluster to a CDH 5.9, you should run distcp from the CDH 5.13 cluster in this manner:
$ hadoop distcp hftp://cdh57-namenode:50070/ hdfs://CDH59-nameservice/ $ hadoop distcp s3a://bucket/ hdfs://CDH59-nameservice/
You can also use a specific path, such as /hbase to move HBase data, for example:
$ hadoop distcp hftp://cdh57-namenode:50070/hbase hdfs://CDH59-nameservice/hbase $ hadoop distcp s3a://bucket/file hdfs://CDH59-nameservice/bucket/file
HFTP Protocol
The HFTP protocol allows you to use FTP resources in an HTTP request. When copying with distcp across different versions of CDH, use hftp:// for the source filesystem and hdfs:// for the destination filesystem, and run distcp from the destination cluster. The default port for HFTP is 50070 and the default port for HDFS is 8020.
Example of a source URI: hftp://namenode-location:50070/basePath
- hftp:// is the source protocol.
- namenode-location is the CDH 4 (source) NameNode hostname as defined by its configured fs.default.name.
- 50070 is the NameNode's HTTP server port, as defined by the configured dfs.http.address.
Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location
- hdfs:// is the destination protocol
- nameservice-id or namenode-location is the CDH 5 (destination) NameNode hostname as defined by its configured fs.defaultFS.
- basePath in both examples refers to the directory you want to copy, if one is specifically needed.
Using DistCp with Amazon S3
You can copy HDFS files to and from an Amazon S3 instance. You must provision an S3 bucket using Amazon Web Services and obtain the access key and secret key. You can pass these credentials on the distcp command line, or you can reference a credential store to "hide" sensitive credentials so that they do not appear in the console output, configuration files, or log files.
Amazon S3 block and native filesystems are supported with the s3a:// protocol.
Example of an Amazon S3 Block Filesystem URI: s3a://bucket_name/path/to/file
<property> <name>fs.s3a.access.key</name> <value>...</value> </property> <property> <name>fs.s3a.secret.key</name> <value>...</value> </property>or run on the command line as follows:
hadoop distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... s3a://
hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey hdfs://user/hdfs/mydata s3a://myBucket/mydata_backup
Using a Credential Provider to Secure S3 Credentials
You can run the distcp command without having to enter the access key and secret key on the command line. This prevents these credentials from being exposed in console output, log files, configuration files, and other artifacts. Running the command in this way requires that you provision a credential store to securely store the access key and secret key.
- Provision the credentials by running the following commands:
hadoop credential create fs.s3a.access.key -value access_key -provider path_to_credential_store_file hadoop credential create fs.s3a.secret.key -value secret_key -provider path_to_credential_store_file
Enter the value for the -provider option using the following syntax:provider_type://file/path_to_credential_store_file
provider_type can be one of the following:- jceks – retrieves credentials from a JavaKeystoreProvider. Use when the credential provider is located in HDFS and requires access from multiple hosts.
- localjceks – retrieves credentials from a LocalJavaKeystoreProvider. Use when the credential provider is located on the local filesystem and access is only required from this single host. Cloudera recommends localjceks.
For example:hadoop credential create fs.s3a.access.key -value foobar -provider localjceks://file/home/keystores/aws.jceks hadoop credential create fs.s3a.secret.key -value barfoo -provider localjceks://file/home/keystores/aws.jceks
For more details on the hadoop credential command, see Credential Management (Apache Software Foundation).
- Copy the contents of the /etc/hadoop/conf directory to a working directory.
- Add the following to the core-site.xml file in the working directory:
<property> <name>hadoop.security.credential.provider.path</name> <value>provider_type://path_to_credential_store_file</value> </property>
- Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
After completing these steps, you can run the distcp command using the following syntax:
hadoop distcp hdfs://source_path s3a://destination_path
hadoop distcp hdfs://source_path s3a://bucket_name/destination_path -D hadoop.security.credential.provider.path=path_to_credential_store_file
There are additional options for the distcp command. See DistCp Guide (Apache Software Foundation).
Examples of DistCP Commands Using the S3 Protocol and Hidden Credentials
- Copying files to Amazon S3
-
hadoop distcp hdfs://user/hdfs/mydata s3a://myBucket/mydata_backup
- Copying files from Amazon S3
-
hadoop distcp s3a://myBucket/mydata_backup hdfs://user/hdfs/mydata
- Copying files to Amazon S3 using the -filters option to exclude specified source files
- You specify a file name with the -filters option. The referenced file contains regular expressions, one per line, that define file name patterns to
exclude from the distcp job. The pattern specified in the regular expression should match the fully-qualified path of the intended files, including the scheme
(hdfs, webhdfs, s3a, etc.). For example, the following are valid expressions for excluding files:
hdfs://x.y.z:8020/a/b/c webhdfs://x.y.z:50070/a/b/c s3a://bucket/a/b/c
Reference the file containing the filter expressions using -filters option. For example:hadoop distcp -filters /user/joe/myFilters hdfs://user/hdfs/mydata s3a://myBucket/mydata_backup
Contents of the sample myFilters file:.*foo.* .*/bar/.* hdfs://x.y.z:8020/tmp/.* hdfs://x.y.z:8020/tmp1/file1
The regular expressions in the myFilters exclude the following files:- .*foo.* – excludes paths that contain the string "foo".
- .*/bar/.* – excludes paths that include a directory named bar.
- hdfs://x.y.z:8020/tmp/.* – excludes all files in the /tmp directory.
- hdfs://x.y.z:8020/tmp1/file1 – excludes the file /tmp1/file1.
- Copying files to Amazon S3 with the -overwrite option.
- The -overwrite option overwrites destination files that already exist.
hadoop distcp -overwrite hdfs://user/hdfs/mydata s3a://user/mydata_backup
For more information about the -filters, -overwrite, and other options, see DIstCp Guide: Command Line Options (Apache Software Foundation).
Kerberos Setup Guidelines for Distcp between Secure Clusters
- Let's assume you have two clusters with the realms: SOURCE and DESTINATION
- You have data that needs to be copied from SOURCE to DESTINATION
- Trust exists between SOURCE and Active Directory, and DESTINATION and Active Directory.
- Both SOURCE and DESTINATION clusters are running CDH 5.3.4 or higher
If your environment matches the one described above, use the following table to configure Kerberos delegation tokens on your cluster so that you can successfully distcp across two secure clusters. Based on the direction of the trust between the SOURCE and DESTINATION clusters, you can use the mapreduce.job.hdfs-servers.token-renewal.exclude property to instruct ResourceManagers on either cluster to skip or perform delegation token renewal for NameNode hosts.
Environment Type | Kerberos Delegation Token Setting | |
---|---|---|
SOURCE trusts DESTINATION | Distcp job runs on the DESTINATION cluster | You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property. |
Distcp job runs on the SOURCE cluster | Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the DESTINATION cluster. | |
DESTINATION trusts SOURCE | Distcp job runs on the DESTINATION cluster | Set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of the hostnames of the NameNodes of the SOURCE cluster. |
Distcp job runs on the SOURCE cluster | You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property. | |
Both SOURCE and DESTINATION trust each other | You do not need to set the mapreduce.job.hdfs-servers.token-renewal.exclude property. | |
Neither SOURCE nor DESTINATION trusts the other | If a common realm is usable (such as Active Directory), set the mapreduce.job.hdfs-servers.token-renewal.exclude property to a comma-separated list of hostnames of the NameNodes of the cluster that is not running the
distcp job. For example, if you are running the job on the DESTINATION cluster:
|
Distcp between Secure Clusters in Distinct Kerberos Realms
This section explains how to copy data between two secure clusters in distinct Kerberos realms.
Specify the Destination Parameters in krb5.conf
[realms] HADOOP.QA.domain.COM = { kdc = kdc.domain.com:88 admin_server = admin.test.com:749 default_domain = domain.com supported_enctypes = arcfour-hmac:normal des-cbc-crc:normal des-cbc-md5:normal des:normal des:v4 des:norealm des:onlyrealm des:afs3 } [domain_realm] .domain.com = HADOOP.test.domain.COM domain.com = HADOOP.test.domain.COM test03.domain.com = HADOOP.QA.domain.COM
Configure HDFS RPC Protection and Acceptable Kerberos Principal Patterns
- Open the Cloudera Manager Admin Console.
- Go to the HDFS service.
- Click the Configuration tab.
- Select
- Select .
- Locate the Hadoop RPC Protection property and select authentication.
- Click Save Changes to commit the changes.
The following steps are not required if the two realms are already set up to trust each other, or have the same principal pattern. However, this isn't usually the case.
- Open the Cloudera Manager Admin Console.
- Go to the HDFS service.
- Click the Configuration tab.
- Select
- Select .
- Edit the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml property to add:
<property> <name>dfs.namenode.kerberos.principal.pattern</name> <value>*</value> </property>
- Click Save Changes to commit the changes.
(If TLS/SSL is enabled) Specify Truststore Properties
<property> <name>ssl.client.truststore.location</name> <value>path_to_truststore</value> </property> <property> <name>ssl.client.truststore.password</name> <value>XXXXXX</value> </property> <property> <name>ssl.client.truststore.type</name> <value>jks</value> </property>
Set HADOOP_CONF to the Destination Cluster
Set the HADOOP_CONF path to be the destination environment. If you are not using HFTP, set the HADOOP_CONF path to the source environment instead.
Launch Distcp
hadoop distcp hdfs://test01.domain.com:8020/user/alice hdfs://test02.domain.com:8020/user/alice
[libdefaults] udp_preference_limit = 1
Enabling Fallback Configuration
<property> <name>ipc.client.fallback-to-simple-auth-allowed</name> <value>true</value> </property>
Protocol Support for Distcp
The following table lists the protocols supported with the distcp command on different versions of CDH. "Secure" means that the cluster is configured to use Kerberos.
Source | Destination | Where to Issue distcp Command | Source Protocol | Source Config | Destination Protocol | Destination Config | Fallback Config Required |
---|---|---|---|---|---|---|---|
CDH 4 | CDH 4 | Destination | hftp | Secure | hdfs or webhdfs | Secure | |
CDH 4 | CDH 4 | Source or Destination | hdfs or webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 4 | CDH 4 | Source or Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Insecure | |
CDH 4 | CDH 4 | Destination | hftp | Insecure | hdfs or webhdfs | Insecure | |
CDH 4 | CDH 5 | Destination | webhdfs or hftp | Secure | webhdfs or hdfs | Secure | |
CDH 4 | CDH 5.1.3+ | Destination | webhdfs | Insecure | webhdfs | Secure | Yes |
CDH 4 | CDH 5 | Destination | webhdfs or hftp | Insecure | webhdfs or hdfs | Insecure | |
CDH 4 | CDH 5 | Source | hdfs or webhdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Source or Destination | webhdfs | Secure | webhdfs | Secure | |
CDH 5 | CDH 4 | Source | hdfs | Secure | webhdfs | Secure | |
CDH 5.1.3+ | CDH 4 | Source | hdfs or webhdfs | Secure | webhdfs | Insecure | Yes |
CDH 5 | CDH 4 | Source or Destination | webhdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Destination | webhdfs | Insecure | hdfs | Insecure | |
CDH 5 | CDH 4 | Source | hdfs | Insecure | webhdfs | Insecure | |
CDH 5 | CDH 4 | Destination | hftp | Insecure | hdfs or webhdfs | Insecure | |
CDH 5 | CDH 5 | Source or Destination | hdfs or webhdfs | Secure | hdfs or webhdfs | Secure | |
CDH 5 | CDH 5 | Destination | hftp | Secure | hdfs or webhdfs | Secure | |
CDH 5.1.3+ | CDH 5 | Source | hdfs or webhdfs | Secure | hdfs or webhdfs | Insecure | Yes |
CDH 5 | CDH 5.1.3+ | Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Secure | Yes |
CDH 5 | CDH 5 | Source or Destination | hdfs or webhdfs | Insecure | hdfs or webhdfs | Insecure | |
CDH 5 | CDH 5 | Destination | hftp | Insecure | hdfs or webhdfs | Insecure |