Distcp syntax and examples
You can use distcp
for copying data between CDP clusters. In
addition, you can also use it to copy data between a CDP cluster and Amazon S3 or Azure Data
Lake Storage Gen 2.
Common use of distcp
The most common use of distcp
is an inter-cluster copy:
hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination
Where hdfs://nn1:8020/source
is the data source, and
hdfs://nn2:8020/destination
is the destination. This will
expand the name space under /source on NameNode "nn1" into a temporary file,
partition its contents among a set of map tasks, and start copying from "nn1" to
"nn2". Note that DistCp requires absolute paths.
You can also specify multiple source directories:
hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs://nn2:8020/destination
Or specify multiple source directories from a file with the -f
option:
hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination
Where srclist
contains:
hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b
Copying between major versions
Run the distcp
command on the cluster that runs the higher version
of CDP, which should be the destination cluster. Use the following syntax:
hadoop distcp webhdfs://<namenode>:<port> hdfs://<namenode>
Note the webhdfs
prefix for the remote cluster, which should be your
source cluster. You must use webhdfs
when the clusters run
different major versions. When clusters run the same version, you can use the
hdfs
protocol for better performance.
For example, the following command copies data from a CDP source cluster named
example-source
to another CDP version destination cluster named
example-dest
:
hadoop distcp webhdfs://example-source.cloudera.com:8020 hdfs://example-dest.cloudera.com
Copying to/from Amazon S3
The following syntax for distcp
shows how to copy data to/from
S3:
#Copying from S3
hadoop distcp s3a://<bucket>/<data> hdfs://<namenode>/<directory>/
#Copying to S3
hadoop distcp hdfs://<namenode>/<directory> s3a://<bucket>/<data>
This is a basic example of using distcp
with S3.
Copying to/from ADLS Gen 2
The following syntax for distcp
shows how to copy data to/from ADLS
Gen 2:
#Copying from ABFS
hadoop distcp abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name> hdfs://hdfs_destination_path
#Copying to ADLS Gen2
hadoop distcp hdfs://hdfs_destination_path abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>