Distcp syntax and examples

You can use distcp for copying data between CDP clusters. In addition, you can also use it to copy data between a CDP cluster and Amazon S3 or Azure Data Lake Storage Gen 2.

Common use of distcp

The most common use of distcp is an inter-cluster copy:

hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination

Where hdfs://nn1:8020/source is the data source, and hdfs://nn2:8020/destination is the destination. This will expand the name space under /source on NameNode "nn1" into a temporary file, partition its contents among a set of map tasks, and start copying from "nn1" to "nn2". Note that DistCp requires absolute paths.

You can also specify multiple source directories:

hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs://nn2:8020/destination

Or specify multiple source directories from a file with the -f option:

hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination

Where srclist contains:

hdfs://nn1:8020/source/a
hdfs://nn1:8020/source/b

Copying between major versions

Run the distcp command on the cluster that runs the higher version of CDP, which should be the destination cluster. Use the following syntax:

hadoop distcp webhdfs://<namenode>:<port> hdfs://<namenode>

Note the webhdfs prefix for the remote cluster, which should be your source cluster. You must use webhdfs when the clusters run different major versions. When clusters run the same version, you can use the hdfs protocol for better performance.

For example, the following command copies data from a CDP source cluster named example-source to another CDP version destination cluster named example-dest:

hadoop distcp webhdfs://example-source.cloudera.com:8020 hdfs://example-dest.cloudera.com

Copying to/from Amazon S3

The following syntax for distcp shows how to copy data to/from S3:

#Copying from S3
hadoop distcp s3a://<bucket>/<data> hdfs://<namenode>/<directory>/
#Copying to S3
hadoop distcp hdfs://<namenode>/<directory> s3a://<bucket>/<data>

This is a basic example of using distcp with S3.

Copying to/from ADLS Gen 2

The following syntax for distcp shows how to copy data to/from ADLS Gen 2:

#Copying from ABFS 
hadoop distcp abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name> hdfs://hdfs_destination_path
#Copying to ADLS Gen2
hadoop distcp hdfs://hdfs_destination_path abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>