S3A and Checksums (Advanced Feature)
The S3A connector can be configured to export the HTTP etag of an object as a checksum,
by setting the option fs.s3a.etag.checksum.enabled
to true
. When
unset (the defaut), S3A objects have no checksum.
$ hadoop fs -touchz s3a://hwdev-bucket/src/something.txt $ hadoop fs -checksum s3a://hwdev-bucket/src/something.txt s3a://hwdev-bucket/src/something.txt NONE
Once set, S3A objects have a checksum which is created on upload.
$ hadoop fs -Dfs.s3a.etag.checksum.enabled=true -checksum s3a://hwdev-bucket/src/something.txt s3a://hwdev-bucket/src/something.txt etag 6434316438636439386630306232303465393830303939386563663834323765
This checksum is not compatible with that or HDFS, so cannot be used to compare file
versions when using the -update
option on DistCp between S3 and HDFS. More
specifically, unless -skipcrccheck
is set, the DistCP operation will fail with
a checksum mismatch. However, it can be used for incremental updates within and across S3A
buckets.
$ hadoop distcp -Dfs.s3a.etag.checksum.enabled=true --update s3a://hwdev-bucket/src s3a://hwdev-bucket/dest $ hadoop fs -Dfs.s3a.etag.checksum.enabled=true -checksum s3a://hwdev-bucket/dest/something.txt s3a://hwdev-bucket/src/something.txt etag 6434316438636439386630306232303465393830303939386563663834323765
As the checksums match small files created as a single block, incremental updates will not copy unchanged files. For large files uploaded using multiple blocks, the checksum values may differ in which case the source file will be copied again.