Job summaries in _SUCCESS
files
You can view and collect job summaries in the _SUCCESS
files.
_SUCCESS
files
The original Hadoop committer creates a zero byte _SUCCESS
file in the
root of the output directory unless disabled.
The manifest committer writes a JSON summary which includes:
- The name of the committer.
- Diagnostics information.
- A list of some of the files created (for testing; a full list is excluded as it can get big).
- IO Statistics.
If, after running a query, this _SUCCESS
file is zero bytes long, the
manifest committer has not been used.
If it is not empty, then it can be examined.
Viewing _SUCCESS
file files through the
ManifestPrinter
tool
The summary files are JSON, and can be viewed in any text editor.
For a more succinct summary, including better display of statistics, use the
ManifestPrinter
tool.
hadoop org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.ManifestPrinter <path>
This works for the files saved at the base of an output directory, and any reports saved to a report directory.
Collecting Job Summaries
The committer can be configured to save the _SUCCESS
summary files to a
report directory, irrespective of whether the job succeed or failed, by setting a fileystem
path in the option
mapreduce.manifest.committer.summary.report.directory
.
The path does not have to be on the same store/filesystem as the destination of work. For example, a local fileystem could be used.
XML:
<property>
<name>mapreduce.manifest.committer.summary.report.directory</name>
<value>file:///tmp/reports</value>
</property>
spark-defaults.conf:
spark.hadoop.mapreduce.manifest.committer.summary.report.directory file:///tmp/reports
This allows for the statistics of jobs to be collected irrespective of their outcome,
whether or not saving the _SUCCESS
marker is enabled, and without problems
caused by a chain of queries overwriting the markers.