Job summaries in _SUCCESS files

You can view and collect job summaries in the _SUCCESS files.

_SUCCESS files

The original Hadoop committer creates a zero byte _SUCCESS file in the root of the output directory unless disabled.

The manifest committer writes a JSON summary which includes:

  • The name of the committer.
  • Diagnostics information.
  • A list of some of the files created (for testing; a full list is excluded as it can get big).
  • IO Statistics.

If, after running a query, this _SUCCESS file is zero bytes long, the manifest committer has not been used.

If it is not empty, then it can be examined.

Viewing _SUCCESS file files through the ManifestPrinter tool

The summary files are JSON, and can be viewed in any text editor.

For a more succinct summary, including better display of statistics, use the ManifestPrinter tool.

hadoop org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.ManifestPrinter <path>

This works for the files saved at the base of an output directory, and any reports saved to a report directory.

Collecting Job Summaries

The committer can be configured to save the _SUCCESS summary files to a report directory, irrespective of whether the job succeed or failed, by setting a fileystem path in the option mapreduce.manifest.committer.summary.report.directory.

The path does not have to be on the same store/filesystem as the destination of work. For example, a local fileystem could be used.

XML:


<property>
  <name>mapreduce.manifest.committer.summary.report.directory</name>
  <value>file:///tmp/reports</value>
</property>

spark-defaults.conf:

spark.hadoop.mapreduce.manifest.committer.summary.report.directory file:///tmp/reports

This allows for the statistics of jobs to be collected irrespective of their outcome, whether or not saving the _SUCCESS marker is enabled, and without problems caused by a chain of queries overwriting the markers.