Working with Google Cloud Storage
The manifest committer is compatible with and tested against Google Cloud Storage
through the gcs-connector library from Google, which provides a Hadoop filesystem client for the
schema gs
.
Google Cloud Storage has the semantics needed for the commit protocol to work safely.
The Spark settings to switch to this committer are:
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol spark.hadoop.mapreduce.manifest.committer.summary.report.directory (optional: URI of a directory for job summaries)
The store's directory delete operations are O(files)
so the value of
mapreduce.manifest.committer.cleanup.parallel.delete
should be left at the
default of true
.
For MapReduce, declare the binding in core-site.xml
or
mapred-site.xml
:
<property> <name>mapreduce.outputcommitter.factory.scheme.gcs</name> <value>org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory</value> </property>