Examples of DistCp commands using the S3 protocol and hidden credentials

You can various distcp command options to copy files between your CDP clusters and Amazon S3.

Copying files to Amazon S3
hadoop distcp /user/hdfs/mydata s3a://myBucket/mydata_backup
Copying files from Amazon S3
hadoop distcp s3a://myBucket/mydata_backup //user/hdfs/mydata
Copying files to Amazon S3 using the -filters option to exclude specified source files
You specify a file name with the -filters option. The referenced file contains regular expressions, one per line, that define file name patterns to exclude from the distcp job. The pattern specified in the regular expression should match the fully-qualified path of the intended files, including the scheme (hdfs, webhdfs, s3a, etc.). For example, the following are valid expressions for excluding files:
hdfs://x.y.z:8020/a/b/c
webhdfs://x.y.z:50070/a/b/c
s3a://bucket/a/b/c
Reference the file containing the filter expressions using -filters option. For example:
hadoop distcp -filters /user/joe/myFilters /user/hdfs/mydata s3a://myBucket/mydata_backup
Contents of the sample myFilters file:
.*foo.*
.*/bar/.*
hdfs://x.y.z:8020/tmp/.*
hdfs://x.y.z:8020/tmp1/file1
The regular expressions in the myFilters exclude the following files:
  • .*foo.* – excludes paths that contain the string "foo".
  • .*/bar/.* – excludes paths that include a directory named bar.
  • hdfs://x.y.z:8020/tmp/.* – excludes all files in the /tmp directory.
  • hdfs://x.y.z:8020/tmp1/file1 – excludes the file /tmp1/file1.
Copying files to Amazon S3 with the -overwrite option.
The -overwrite option overwrites destination files that already exist.
hadoop distcp -overwrite /user/hdfs/mydata  s3a://user/mydata_backup

For more information about the -filters, -overwrite, and other options, see DistCp Guide: Command Line Options (Apache Software Foundation).