Introduction to S3Guard
Amazon S3 is an object store, not a filesystem. There are no directories, only objects. The S3A connector lets Hadoop, Hive and Spark applications see files in a directory tree, but really they are working on the objects underneath, by listing them and working on each one one-by-one.
Some of the operations which filesystems support are actually absent, with rename being the key one. The S3A connector mimics file or directory rename, by copying each file then deleting the original, which takes about 6-10 megabytes/second.
The S3 Object Store is "eventually consistent": when a file is deleted or overwritten it can take time for that change to propagate across all servers replicating the data. As a result, newly deleted files can still be visible, while queries of updated files can return the old version.
How long is "eventually"? There's no official number; the paper "Benchmarking Eventual Consistency" has shown it can vary by time of day, and be ten seconds or more -sometimes much more.
A critical problem is listing inconsistency: when a query is made of S3 to list all objects under a specific path, that listing can be out of date. This means that those operation on files under a "directory" mimicked by listing and acting on all objects underneath it are at risk of not seeing the complete list of files. Newly created files are at most risk.
This may affect the following operations on S3 data:
When listing files, newly created objects may not be listed immediately and deleted objects may continue to be listed — which means that your input for data processing may be incorrect. In Hive, Spark, or MapReduce, this could lead to erroneous results. In the worst case, it could potentially lead to data loss at the time of data movement.
When renaming directories, the listing may be incomplete or out of date, so the rename operation loses files. This is very dangerous as MapReduce, Hive, Spark and Tez all rely on rename to commit the output of workers to the final output of the job. If data is lost, the output is incorrect —something which may not be immediately obvious.
When deleting directories, the listing may be inconsistent, so not all objects are deleted. If another job writes data to the same directory path, the old data may still be present.
During an ETL workflow, in a sequence of multiple jobs that form the workflow, the next job is launched soon after the previous job has been completed. Applications such as Oozie rely on marker files to trigger the subsequent workflows. Any delay in the visibility of these files can lead to delays in the subsequent workflows.
During existence-guarded path operations, if a deleted file which has the same name as a target path appears in a listing, some actions may unexpectedly fail due to the target path being present — even though the file has already been deleted.
This eventually consistent behavior of S3 can cause seemingly unpredictable results from queries made against it, limiting the practical utility of the S3A connector for use cases where data gets modified.
S3Guard mitigates the issues related to S3's eventual consistency on listings by using a table on Amazon DynamoDB as a consistent metadata store. This guarantees a consistent view of data stored in S3. In addition, S3Guard may improve query performance by reducing the number of times S3 needs to be contacted, —as DynamoDB is significantly faster.
How S3Guard Works
S3Guard is a feature in the Hadoop S3A connector which uses Amazon's DynamoDB to cache information about created and deleted files, "The S3Guard Database".
When an application using the S3A Connector with S3Guard enabled manipulates objects in S3, such as creating, deleting or "renaming" them, the S3Guard database is updated. Newly created/copied files are added to the table, while deleted files have "tombstone markers" added to indicate that they have been deleted.
When a directory is listed on S3, the information from the S3Guard database is used to update the listing with this information: new files are added while those files with tombstone markers are not included in a listing.
As a result, the listing is up to date with all operations performed on the directory by all clients using S3Guard.
When a file is opened for reading, or any existence check made, the S3Guard database is checked first. If an entry is found in the database. that is used as the response —omitting all checks of S3 itself. This includes tombstone markers, which are used as evidence that a file does not exist. The caller is given the current state of the object (existence/nonexistence, size and type), without S3 being queried at all. This can be significantly faster than interacting with S3.
With this design, directory listings are kept consistent across a sequence of operations, even across multiple servers —indeed, across multiple Hadoop clusters.
What S3Guard Cannot Do
Guarantee access to updated data. While it keeps the listing consistent, attempts to read newly updated data may still return the old value. The only way to guarantee that newly-created data is immediately visible is to use different filenames when writing the new data.
Mimic the "directory rename is a single atomic transaction" behavior of a filesystem like HDFS. Directory renames are still slow and visible while in progress. This means that if the operations fail partway through, the source and destination paths may contain a mix (including some duplicate) copies of data files.