Limitations of the S3A Committers
Custom file output formats and their committers
Output formats which implement their own committers do not automatically switch to the new committers. If such a custom committer relies on renaming files to commit output, then it will depend on S3Guard for a consistent view of the object store, and take time to commit output proportional to the amount of data and the number of files.
To determine if this is the case, find the subclass of
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
which implements the
custom format, to see if it subclasses thegetOutputCommitter()
to return its
own committer, or has a custom subclass
oforg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
.
It may be possible to migrate such a committer to support store-specific committers, as
was done for Apache Parquet support in Spark. Here a subclass of Parquet's
ParquetOutputCommitter
was implemented to delegates all operations to the
real committer.
MapReduce V1 API Output Format and Committers
Only the MapReduce V2 APIs underorg.apache.hadoop.mapreduce
support the new
committer binding mechanism. The V1 APIs under the org.apache.hadoop.mapred
package only bind to the file committer and subclasses. The v1 APIs date from Hadoop 1.0 and
should be considered obsolete. Please migrate to the v2 APIs, not just for the new
committers, but because the V2 APIs are still being actively developed and maintained.
No Hive Support
There is currently no Hive support for the S3A committers. To safely use S3 as a destination of Hive work, you must use S3Guard.