Configuring and Tuning S3A Fast Upload

« Prev

	Note
	These tuning recommendations are experimental and may change in the future.

Because of the nature of the S3 object store, data written to an S3A OutputStream is not written incrementally — instead, by default, it is buffered to disk until the stream is closed in its close() method. This can make output slow because the execution time for OutputStream.close() is proportional to the amount of data buffered and inversely proportional to the bandwidth between the host to S3; that is O(data/bandwidth). Other work in the same process, server, or network at the time of upload may increase the upload time.

In summary, the further the process is from the S3 endpoint, or the smaller the EC2 VM is, the longer it will take complete the work. This can create problems in application code:

Code often assumes that the close() call is fast; the delays can create bottlenecks in operations.
Very slow uploads sometimes cause applications to time out - generally, threads blocking during the upload stop reporting progress, triggering timeouts.
Streaming very large amounts of data may consume all disk space before the upload begins.

Enabling S3A Fast Upload

To enable the fast upload mechanism, set the fs.s3a.fast.upload property (it is disabled by default).

When this is set, the incremental block upload mechanism is used, with the buffering mechanism set in fs.s3a.fast.upload.buffer. The number of threads performing uploads in the filesystem is defined by fs.s3a.threads.max; the queue of waiting uploads limited by fs.s3a.max.total.tasks. The size of each buffer is set by fs.s3a.multipart.size.

Configuring S3A Fast Upload Options

The following major configuration options are available for the S3A fast upload:

Table 3.2. S3A Fast Upload Configuration Options

Parameter	Default Value	Description
`fs.s3a.fast.upload.buffer`	disk	The `fs.s3a.fast.upload.buffer` determines the buffering mechanism to use when `fs.s3a.fast.upload` is set to "true"; it has no effect when `fs.s3a.fast.upload` is false. Allowed values are: disk, array, bytebuffer: (default) "disk" will use the directories listed in `fs.s3a.buffer.dir` as the location(s) to save data prior to being uploaded. "array" uses arrays in the JVM heap. "bytebuffer" uses off-heap memory within the JVM. Both "array" and "bytebuffer" will consume memory in a single stream up to the number of blocks set by: `fs.s3a.multipart.size` * `fs.s3a.fast.upload.active.blocks`. If using either of these mechanisms, keep this value low. The total number of threads performing work across all threads is set by `fs.s3a.threads.max`, with `fs.s3a.max.total.tasks` values setting the number of queued work items.
`fs.s3a.multipart.size`	100M	Defines the size (in bytes) of the chunks into which the upload or copy operations will be split up. A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
`fs.s3a.fast.upload.active.block`	8	Defines the maximum number of blocks a single output stream can have active uploading, or queued to the central FileSystem instance's pool of queued operations. This stops a single stream overloading the shared thread pool.
`fs.s3a.buffer.dir`	Empty value	Allows you to add a comma separated list of temporary directories use for storing blocks of data prior to their being uploaded to S3. When unset (by default), the Hadoop temporary directory `hadoop.tmp.dir` is used.

Note that:

If the amount of data written to a stream is below that set in fs.s3a.multipart.size, the upload is performed in the OutputStream.close() operation —as with the original output stream.
The published Hadoop metrics monitor includes live queue length and upload operation counts, so identifying when there is a backlog of work or a mismatch between data generation rates and network bandwidth. Per-stream statistics can also be logged by calling toString() on the current stream.
Incremental writes are not visible; the object can only be listed or read when the multipart operation completes in the close() call, which will block until the upload is completed.

Fast Upload with Disk Buffers

This is the default buffer mechanism. The amount of data which can be buffered is limited by the amount of available disk space.

When fs.s3a.fast.upload.buffer is set to "disk", all data is buffered to local hard disks prior to upload. This minimizes the amount of memory consumed, and so eliminates heap size as the limiting factor in queued uploads — exactly as the original "direct to disk" buffering used when fs.s3a.fast.upload=false.

Fast Upload with Byte Buffers

When fs.s3a.fast.upload.buffer is set to "bytebuffer", all data is buffered in "direct" ByteBuffers prior to upload. This may be faster than buffering to disk in cases such as when disk space is small there may not be much disk space to buffer with (for example, when using tiny EC2 VMs).

The ByteBuffers are created in the memory of the JVM, but not in the Java Heap itself. The amount of data which can be buffered is limited by the Java runtime, the operating system, and, for YARN applications, the amount of memory requested for each container.

The slower the upload bandwidth to S3, the greater the risk of running out of memory — and so the more care is needed in tuning the upload thread settings to reduce the maximum amount of data which can be buffered awaiting upload (see below).

Fast Upload with Array Buffers

When fs.s3a.fast.upload.buffer is set to "array", all data is buffered in byte arrays in the JVM's heap prior to upload. This may be faster than buffering to disk.

The amount of data which can be buffered is limited by the available size of the JVM heap heap. The slower the write bandwidth to S3, the greater the risk of heap overflows. This risk can be mitigated by tuning the upload thread settings (see below).

Thread Tuning for S3A Fast Upload

Both the array and bytebuffer buffer mechanisms can consume very large amounts of memory, on-heap or off-heap respectively. The disk buffer mechanism does not use much memory up, but it consumes hard disk capacity.

If there are many output streams being written to in a single process, the amount of memory or disk used is the multiple of all stream's active memory and disk use.

You may need to perform careful tuning to reduce the risk of running out memory, especially if the data is buffered in memory. There are a number parameters which can be tuned:

The total number of threads available in the filesystem for data uploads or any other queued filesystem operation. This is set in fs.s3a.threads.max.
The number of operations which can be queued for execution, awaiting a thread. This is set in fs.s3a.max.total.tasks.
The number of blocks which a single output stream can have active (that is, being uploaded by a thread or queued in the filesystem thread queue). This is set in fs.s3a.fast.upload.active.blocks.
The length of time that an idle thread can stay in the thread pool before it is retired. This is set in fs.s3a.threads.keepalivetime.

Table 3.3. S3A Fast Upload Tuning Options

Parameter	Default Value	Description
`fs.s3a.fast.upload.active.blocks`	4	Maximum number of blocks a single output stream can have active (uploading, or queued to the central FileSystem instance's pool of queued operations). This stops a single stream overloading the shared thread pool.
`fs.s3a.threads.max`	10	The total number of threads available in the filesystem for data uploads or any other queued filesystem operation.
`fs.s3a.max.total.tasks`	5	The number of operations which can be queued for execution
`fs.s3a.threads.keepalivetime`	60	The number of seconds a thread can be idle before being terminated.

When the maximum allowed number of active blocks of a single stream is reached, no more blocks can be uploaded from that stream until one or more of those active block uploads completes. That is, a write() call which would trigger an upload of a now full datablock will instead block until there is capacity in the queue.

Consider the following:

As the pool of threads set in fs.s3a.threads.max is shared (and intended to be used across all threads), a larger number here can allow for more parallel operations. However, as uploads require network bandwidth, adding more threads does not guarantee speedup.
The extra queue of tasks for the thread pool (fs.s3a.max.total.tasks) covers all ongoing background S3A operations.
When using memory buffering, a small value of fs.s3a.fast.upload.active.blocks limits the amount of memory which can be consumed per stream.
When using disk buffering, a larger value of fs.s3a.fast.upload.active.blocks does not consume much memory. But it may result in a large number of blocks to compete with other filesystem operations.

We recommend a low value of fs.s3a.fast.upload.active.blocks — enough to start background upload without overloading other parts of the system. Then experiment to see if higher values deliver more throughput — especially from VMs running on EC2.

​Configuring and Tuning S3A Fast Upload

​Enabling S3A Fast Upload

​Configuring S3A Fast Upload Options

​Thread Tuning for S3A Fast Upload

Enabling S3A Fast Upload

Configuring S3A Fast Upload Options

Thread Tuning for S3A Fast Upload