Troubleshooting the S3A Committers
Refer to the Apache documentation for troubleshooting the committers.
The primary issue which surfaces is actually programs not switching to the new committers. There are three common reasons for this.
The configuration settings to switch to the new committer are not being picked up. This is particularly common in Spark, which is harder to set up.
The program is using the older V1 MapReduce APIs. Fix: switch to the V2 API.
The output format the program uses is explicitly creating its own committer. This can only be fixed by modifying the program.
To help debug Spark's configuration, there is an option which can be set to forcibly fail the spark query if the path output committer is used, but for some reason the file committer is being returned.
spark.hadoop.pathoutputcommit.reject.fileoutput true
There is also the blunt-instrument approach of causing the original output committer to crash with an invalid configuration.
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 3
This (invalid) option ensures that if the original file committer is used, it will raise an exception.
To enable low-level logging of the committers, set the log-level of
the package org.apache.hadoop.fs.s3a.commit
to DEBUG.
With Log4J, this can be one in log4j.properties
:
log4j.logger.org.apache.hadoop.fs.s3a.commit=DEBUG