2. Command Line Options

Flag Description Notes
-p[rbugpca] Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL Modification times are not preserved. Also, when -update is specified, status updates will not be synchronized unless the file sizes also differ (i.e. unless the file is recreated). If -pa is specified, DistCp also preserves the permissions because ACLs are a super-set of permissions.
-i Ignore failures This option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted.
-log <logdir> Write logs to <logdir> DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed.
-m <num_maps> Maximum number of simultaneous copies Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput.
-overwrite Overwrite destination If a map fails and -i is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-update Overwrite if src size different from dst size As noted in the preceding, this is not a “sync” operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-f <urilist_uri> Use list at <urilist_uri> as src list This is equivalent to listing each source on the command line. The urilist_uri list should be a fully qualified URI.
-filelimit <n> Limit the total number of files to be <= n Deprecated! Ignored in DistCp v2.
-sizelimit <n> Limit the total size to be <= n bytes Deprecated! Ignored in DistCp v2.
-delete Delete the files existing in the dst but not in src The deletion is done by FS Shell. So the trash will be used, if it is enabled.
-strategy {dynamic|uniformsize} Choose the copy-strategy to be used in DistCp. By default, uniformsize is used. (i.e. Maps are balanced on the total size of files copied by each map. Similar to legacy.) If dynamic is specified, DynamicInputFormat is used instead. (This is described in the Architecture section, under InputFormats.)
-bandwidth Specify bandwidth per map, in MB/second. Each map will be restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy, such that the net bandwidth used tends towards the specified value.
-atomic {-tmp <tmp_dir>} Specify atomic commit, with optional tmp directory. -atomic instructs DistCp to copy the source data to a temporary target location, and then move the temporary target to the final location atomically. Data will either be available at final target in a complete and consistent form, or not at all. Optionally, -tmp may be used to specify the location of the tmp-target. If not specified, a default is chosen. Note: tmp_dir must be on the final target cluster.
-mapredSslConf <ssl_conf_file> Specify SSL Config file, to be used with HSFTP source When using the hsftp protocol with a source, the security-related properties may be specified in a config file and passed to DistCp. <ssl_conf_file> needs to be in the classpath.
-async Run DistCp asynchronously. Quits as soon as the Hadoop Job is launched. The Hadoop Job-id is logged, for tracking.