Configuring Directories for Intermediate Data

In addition to, two other core-site.xml configuration options are used to control where intermediate is stored.

A location is in the local filesystem for buffering data

  <description>Comma separated list of directories that will be used to buffer file
  uploads to.</description>

These directories will store the output created by all active tasks until each task is committed; the more worker processes/spark worker threads a host can support, the more disk space will be needed. Multiple disks can be listed to help spread the load, and recover from disk failure.

A location in the cluster's HDFS filesystem to share summary data about pending uploads.


These files are generally quite small: a few kilobytes per task.