Configuring Directories for Intermediate Data
In addition to fs.s3a.committer.name
, two other
core-site.xml
configuration options are used to control where intermediate is
stored.
A location is in the local filesystem for buffering data
<property> <name>fs.s3a.buffer.dir</name> <value>${hadoop.tmp.dir}/s3a</value> <description>Comma separated list of directories that will be used to buffer file uploads to.</description> </property>
These directories will store the output created by all active tasks until each task is committed; the more worker processes/spark worker threads a host can support, the more disk space will be needed. Multiple disks can be listed to help spread the load, and recover from disk failure.
A location in the cluster's HDFS filesystem to share summary data about pending uploads.
<property> <name>fs.s3a.committer.staging.tmp.path</name> <value>tmp/staging</value> </property>
These files are generally quite small: a few kilobytes per task.