Accessing Cloud Data
Also available as:
PDF
loading table of contents...

Hive and S3: The Need for S3Guard

Hive's listing of source files, and its renaming operations, all need a consistent view of the filesystem's metadata: listings of directories are required to return a complete list of all files which exist underneath, and none which have been deleted. HDFS, Azure and ADL all meet this requirement. Amazon S3 does not directly meet the requirements of Hive.

To safely use S3 as a destination of Hive jobs, S3Guard must be enabled for the destination bucket.

It may be possible to use unguarded S3 Buckets as source of S3 data, provided the source data has been unchanged “long enough” for the object metadata to become consistent across all S3 servers. There is, however, no general consensus on what duration constitutes “long enough”, especially in the presence of failures. Unchanging data is of course not at risk.