Configuring Directories for Intermediate Data

In addition to fs.s3a.committer.name, two other core-site.xml configuration options are used to control where intermediate is stored.

A location is in the local filesystem for buffering data

<property>
  <name>fs.s3a.buffer.dir</name>
  <value>${hadoop.tmp.dir}/s3a</value>
  <description>Comma separated list of directories that will be used to buffer file
  uploads to.</description>
</property> 

These directories will store the output created by all active tasks until each task is committed; the more worker processes/spark worker threads a host can support, the more disk space will be needed. Multiple disks can be listed to help spread the load, and recover from disk failure.

A location in the cluster's HDFS filesystem to share summary data about pending uploads.

<property>
  <name>fs.s3a.committer.staging.tmp.path</name>
  <value>tmp/staging</value>
</property>

These files are generally quite small: a few kilobytes per task.