Introduction to S3Guard

Amazon S3 is an object store, not a filesystem. There are no directories, only objects. The S3A connector lets Hadoop, Hive and Spark applications see files in a directory tree, but really they are working on the objects underneath, by listing them and working on each one one-by-one.

Some of the operations which filesystems support are actually absent, with rename being the key one. The S3A connector mimics file or directory rename, by copying each file then deleting the original, which takes about 6-10 megabytes/second.

The S3 Object Store is "eventually consistent": when a file is deleted or overwritten it can take time for that change to propagate across all servers replicating the data. As a result, newly deleted files can still be visible, while queries of updated files can return the old version.

There is no specific time period by when the object store will be eventually consistent. The paper "Benchmarking Eventual Consistency" has shown it can vary by time of day, and be ten seconds or more – sometimes much more.

A critical problem is listing inconsistency: when a query is made of S3 to list all objects under a specific path, that listing can be out of date. This means that those operation on files under a "directory" mimicked by listing and acting on all objects underneath it are at risk of not seeing the complete list of files. Newly created files are at most risk.

This may affect the following operations on S3 data:

  • When listing files, newly created objects may not be listed immediately and deleted objects may continue to be listed — which means that your input for data processing may be incorrect. In Hive, Spark, or MapReduce, this could lead to erroneous results. In the worst case, it could potentially lead to data loss at the time of data movement.

  • When renaming directories, the listing may be incomplete or out of date, so the rename operation loses files. This is very dangerous as MapReduce, Hive, Spark and Tez all rely on rename to commit the output of workers to the final output of the job. If data is lost, the output is incorrect —something which may not be immediately obvious.

  • When deleting directories, the listing may be inconsistent, so not all objects are deleted. If another job writes data to the same directory path, the old data may still be present.

  • During an ETL workflow, in a sequence of multiple jobs that form the workflow, the next job is launched soon after the previous job has been completed. Applications such as Oozie rely on marker files to trigger the subsequent workflows. Any delay in the visibility of these files can lead to delays in the subsequent workflows.

  • During existence-guarded path operations, if a deleted file which has the same name as a target path appears in a listing, some actions may unexpectedly fail due to the target path being present — even though the file has already been deleted.

This eventually consistent behavior of S3 can cause seemingly unpredictable results from queries made against it, limiting the practical utility of the S3A connector for use cases where data gets modified.

S3Guard mitigates the issues related to S3's eventual consistency on listings by using a table on Amazon DynamoDB as a consistent metadata store. This guarantees a consistent view of data stored in S3. In addition, S3Guard may improve query performance by reducing the number of times S3 needs to be contacted, —as DynamoDB is significantly faster.

How S3Guard Works

S3Guard is a feature in the Hadoop S3A connector which uses Amazon's DynamoDB to cache information about created and deleted files, "The S3Guard Database".

When an application using the S3A Connector with S3Guard enabled manipulates objects in S3, such as creating, deleting or "renaming" them, the S3Guard database is updated. Newly created/copied files are added to the table, while deleted files have "tombstone markers" added to indicate that they have been deleted.

When a directory is listed on S3, the information from the S3Guard database is used to update the listing with this information: new files are added while those files with tombstone markers are not included in a listing.

As a result, the listing is up to date with all operations performed on the directory by all clients using S3Guard.

When a file is opened for reading, or any existence check made, the S3Guard database is checked first. If an entry is found in the database. that is used as the response —omitting all checks of S3 itself. This includes tombstone markers, which are used as evidence that a file does not exist. The caller is given the current state of the object (existence/nonexistence, size and type), without S3 being queried at all. This can be significantly faster than interacting with S3.

With this design, directory listings are kept consistent across a sequence of operations, even across multiple servers —indeed, across multiple Hadoop clusters.

What S3Guard Cannot Do

S3Guard only stores filenames, lengths, "etags" and optional version data in its DynamoDB tables. The contents of the files are still stored in S3. If a file is overwritten, S3's eventual consistency may mean that the old data is still present. S3 tries to detect this by using the following techniques:
  • Storing the "etag" checksum of each file and detecting when the value is different from that in S3.
  • In an S3 bucket with "versioning" enabled, the version is used to guarantee that the file loaded is the one matching the S3Guard record.

When a file is loaded, the S3A connector will make repeated attempts to load a file with the expected etag and version ID before concluding that it is not available – at which point the operation will fail.

S3Guard does not provide the "atomic directory rename" operation which many applications rely on to commit their intermediate work to the final destination of the job. This does not work in S3 where files are copied one-by-one in a slow process. If this operation is interrupted, the source and destination will be in an unknown state.

  • For Spark applications, use an S3A committer.
  • When copying files using distcp, avoid the -atomic option. All it will do is make the copy slower. Use the -direct operation for a direct upload, and be aware that a failure during the copy can leave the destination inconsistent.