The DistCp -update
option is used to copy files from a source that do not
exist at the target, or that have different contents. The DistCp
-overwrite
option overwrites target files even if they exist at the source,
or if they have the same contents.
The -update
and -overwrite
options warrant further
discussion, since their handling of source-paths varies from the defaults in a very
subtle manner.
Consider a copy from /source/first/
and
/source/second/
to /target/
, where the source paths have
the following contents:
hdfs://nn1:8020/source/first/1 hdfs://nn1:8020/source/first/2 hdfs://nn1:8020/source/second/10 hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update
or -overwrite
, the
DistCp defaults would create directories first/
and second/
,
under /target
. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target
:
hdfs://nn2:8020/target/first/1 hdfs://nn2:8020/target/first/2 hdfs://nn2:8020/target/second/10 hdfs://nn2:8020/target/second/20
When either -update
or -overwrite
is specified, the
contents of the source directories are copied to
the target, and not the source directories themselves. Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target
:
hdfs://nn2:8020/target/1 hdfs://nn2:8020/target/2 hdfs://nn2:8020/target/10 hdfs://nn2:8020/target/20
By extension, if both source folders contained a file with the same name ("0", for
example), then both sources would map an entry to /target/0
at the
destination. Rather than permit this conflict, DistCp will abort.
Now, consider the following copy operation:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
With sources/sizes:
hdfs://nn1:8020/source/first/1 32 hdfs://nn1:8020/source/first/2 32 hdfs://nn1:8020/source/second/10 64 hdfs://nn1:8020/source/second/20 32
And destination/sizes:
hdfs://nn2:8020/target/1 32 hdfs://nn2:8020/target/10 32 hdfs://nn2:8020/target/20 64
Will effect:
hdfs://nn2:8020/target/1 32 hdfs://nn2:8020/target/2 32 hdfs://nn2:8020/target/10 64 hdfs://nn2:8020/target/20 32
1
is skipped because the file-length and contents
match. 2
is copied because it doesn’t exist at the
target. 10
and 20
are overwritten since the contents don’t
match the source.
If the -update
option is used, 1
is overwritten as
well.