Limitations of the S3A Committers
There are limitations of the S3A committers associated with custom file output formats, MapReduce API output format, and non-availability of Hive support.
Custom file output formats and their committers
Output formats which implement their own committers do not automatically switch to the new committers. If such a custom committer relies on renaming files to commit output, then it will depend on S3Guard for a consistent view of the object store, and take time to commit output proportional to the amount of data and the number of files.
To determine if this is the case, find the subclass of
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
which implements
the custom format, to see if it subclasses the getOutputCommitter()
to
return its own committer, or has a custom subclass
oforg.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
.
It may be possible to migrate such a committer to support store-specific committers, as was
done for Apache Parquet support in Spark. Here a subclass of Parquet's
ParquetOutputCommitter
was implemented to delegates all operations to the
real committer.
MapReduce V1 API Output Format and Committers
Only the MapReduce V2 APIs underorg.apache.hadoop.mapreduce
support the
new committer binding mechanism. The V1 APIs under the
org.apache.hadoop.mapred
package only bind to the file committer and
subclasses. The v1 APIs date from Hadoop 1.0 and should be considered obsolete. Please
migrate to the v2 APIs, not just for the new committers, but because the V2 APIs are still
being actively developed and maintained.
No Hive Support
There is currently no Hive support for the S3A committers. To safely use S3 as a destination of Hive work, you must use S3Guard.