Cloud Data Access
Also available as:
PDF
loading table of contents...

Committing Output to S3

For the reasons covered in Limitations of Amazon S3, using S3 as the direct destination of work may be slow and unreliable in the presence of failures. Therefore, we recommend that you use HDFS as the destination of work, using DistCp to copy to S3 afterwards if you wish to persist beyond the life of the cluster. HDFS has the behaviors critical to the output committer used by Spark and Hadoop MapReduce to ensure the output is correctly generated (atomic directory renames and consistent directory listings).