Accessing Cloud Data
Also available as:
loading table of contents...

S3Guard: Operational Issues

The following operational issues have been identified while testing S3Guard.

S3Guard and Hive

S3Guard provides the consistent listings needed for listing data and committing output. Without S3Guard, it is not safe to use an S3 bucket as a direct destination of Hive queries.

S3Guard and MapReduce and Spark

S3Guard provides the consistent listings needed for listing data and committing output. Apache Spark and Hadoop MapReduce now support a high-performance “committer” to safely write their output to an S3 Bucket, even if S3Guard is disabled. However, without S3Guard it is not safe for one query to read the output of another query “recently” written to the S3A bucket. For this reason, even if a new “S3A Committer” is used, we recommend the use of S3Guard.

Third-party S3-compatible object stores

Third-party object stores which reimplement the AWS S3 protocol are usually "consistent". As such, there is no need to use S3Guard. Consult the object store's supplier as to its consistency model.

S3Guard Security Aspects

The DynamoDB table needs to be writeable by all users/services using S3Guard. If a single DynamoDB table is used to store metadata about multiple buckets, then clients with access to the table will be able to read the metadata about objects in any bucket to which their read access restricted via AWS permissions.

The standard S3 Bucket and Object Access permissions do not provide any restriction on accessing the S3Guard index data. As this is only the Hadoop file status data of object name, type, size and timestamp, the actual object data and any tags attached to the object are still protected by AWS permissions. However, directory and filenames will be visible.

Limitations of S3Guard

The key limitation of S3Guard is that it only provides consistent file and directory listings. It does not address update and delete consistency of the data.

It is only consistent with respect to changes made by client applications using the S3A connector with S3Guard enabled and the same DynamoDB table. Changes which are made by other applications are only eventually consistent from the perspective of S3A clients.

Unsupported Feature: Authoritative Mode

S3Guard has an experimental option, fs.s3a.metadatastore.authoritative, which declares that the S3Guard database should be treated as the reference for all file status and directory listings, and that the S3 repository itself need not be queried. That is, the DynamoDB table moves from being a cache of file information to the "source of truth".

This makes listing operations significantly faster, because there is no need to ever make slow "LIST" calls of the S3 store. However, it is dangerous because if S3Guard ever becomes inconsistent with the S3 store, then files may not be listed when working with the S3 bucket, resulting in incomplete/wrong data with Hive, MapReduce and Spark queries. It also requires every single application working with an S3 Bucket to use S3Guard.

We do not currently recommend using authoritative mode.