Update and overwrite
Use the -update
option to copy files from a source when they do not
exist at the target. Use the -overwrite
function to overwrite the target files
even if the content is the same.
The DistCp -update
option is used to copy files from a
source that does not exist at the target, or that has different contents. The
DistCp -overwrite
option overwrites target files even if
they exist at the source, or if they have the same contents.
The -update
and -overwrite
options warrant further discussion, since their handling of source-paths varies from the defaults in a very subtle manner.
Consider a copy from /source/first/
and /source/second/
to /target/
, where the source paths have the following contents:
hdfs://nn1:8020/source/first/1 hdfs://nn1:8020/source/first/2 hdfs://nn1:8020/source/second/10 hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update
or -overwrite
, the DistCp defaults would create directories first/
and second/
, under /target
. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/first/1 hdfs://nn2:8020/target/first/2 hdfs://nn2:8020/target/second/10 hdfs://nn2:8020/target/second/20
When either -update
or -overwrite
is specified, the contents of the source directories are copied to the target, and not the source directories themselves. Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target
:
hdfs://nn2:8020/target/1 hdfs://nn2:8020/target/2 hdfs://nn2:8020/target/10 hdfs://nn2:8020/target/20
By extension, if both source folders contained a file with the same name ("0", for example), then both sources would map an entry to /target/0
at the destination. Rather than permit this conflict, DistCp will abort.
Now, consider the following copy operation:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
With sources/sizes:
hdfs://nn1:8020/source/first/1 32 hdfs://nn1:8020/source/first/2 32 hdfs://nn1:8020/source/second/10 64 hdfs://nn1:8020/source/second/20 32
And destination/sizes:
hdfs://nn2:8020/target/1 32 hdfs://nn2:8020/target/10 32 hdfs://nn2:8020/target/20 64
Will effect:
hdfs://nn2:8020/target/1 32 hdfs://nn2:8020/target/2 32 hdfs://nn2:8020/target/10 64 hdfs://nn2:8020/target/20 32
1
is skipped because the file-length and contents match. 2
is copied because it does not exist at the target. 10
and 20
are overwritten because the contents don’t match the source.
If the -update
option is used, 1 is overwritten as well.