Working with Google Cloud Storage

The manifest committer is compatible with and tested against Google Cloud Storage through the gcs-connector library from Google, which provides a Hadoop filesystem client for the schema gs.

Google Cloud Storage has the semantics needed for the commit protocol to work safely.

The Spark settings to switch to this committer are: org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
spark.sql.sources.commitProtocolClass  (optional: URI of a directory for job summaries)

The store's directory delete operations are O(files) so the value of mapreduce.manifest.committer.cleanup.parallel.delete should be left at the default of true.

For MapReduce, declare the binding in core-site.xmlor mapred-site.xml: