Configuring and Tuning S3A Fast Upload
Note | |
---|---|
These tuning recommendations are experimental and may change in the future. |
Because of the nature of the S3 object store, data written to an S3A
OutputStream
is not written incrementally — instead, by default, it is
buffered to disk until the stream is closed in its close()
method. This can
make output slow because the execution time for OutputStream.close()
is
proportional to the amount of data buffered and inversely proportional to the bandwidth
between the host to S3; that is O(data/bandwidth)
. Other work in the same
process, server, or network at the time of upload may increase the upload time.
In summary, the further the process is from the S3 endpoint, or the smaller the EC2 VM is, the longer it will take complete the work. This can create problems in application code:
Code often assumes that the
close()
call is fast; the delays can create bottlenecks in operations.Very slow uploads sometimes cause applications to time out - generally, threads blocking during the upload stop reporting progress, triggering timeouts.
Streaming very large amounts of data may consume all disk space before the upload begins.
Enabling S3A Fast Upload
To enable the fast upload mechanism, set the fs.s3a.fast.upload
property (it is disabled by default).
When this is set, the incremental block upload mechanism is used, with the buffering
mechanism set in fs.s3a.fast.upload.buffer
. The number of threads
performing uploads in the filesystem is defined by fs.s3a.threads.max
; the
queue of waiting uploads limited by fs.s3a.max.total.tasks
. The size of
each buffer is set by fs.s3a.multipart.size
.
Configuring S3A Fast Upload Options
The following major configuration options are available for the S3A fast upload:
Table 3.2. S3A Fast Upload Configuration Options
Parameter | Default Value | Description |
---|---|---|
fs.s3a.fast.upload.buffer | disk |
The Allowed values are: disk, array, bytebuffer:
Both "array" and "bytebuffer" will consume memory in a single stream up
to the number of blocks set by: The total number of threads performing work across all threads is set by
|
fs.s3a.multipart.size | 100M | Defines the size (in bytes) of the chunks into which the upload or copy operations will be split up. A suffix from the set {K,M,G,T,P} may be used to scale the numeric value. |
fs.s3a.fast.upload.active.block
| 8 | Defines the maximum number of blocks a single output stream can have active uploading, or queued to the central FileSystem instance's pool of queued operations. This stops a single stream overloading the shared thread pool. |
fs.s3a.buffer.dir | Empty value | Allows you to add a comma separated list of temporary directories use for
storing blocks of data prior to their being uploaded to S3. When unset (by
default), the Hadoop temporary directory hadoop.tmp.dir
is used. |
Note that:
If the amount of data written to a stream is below that set in
fs.s3a.multipart.size
, the upload is performed in theOutputStream.close()
operation —as with the original output stream.The published Hadoop metrics monitor includes live queue length and upload operation counts, so identifying when there is a backlog of work or a mismatch between data generation rates and network bandwidth. Per-stream statistics can also be logged by calling
toString()
on the current stream.Incremental writes are not visible; the object can only be listed or read when the multipart operation completes in the
close()
call, which will block until the upload is completed.
Fast Upload with Disk Buffers
This is the default buffer mechanism. The amount of data which can be buffered is limited by the amount of available disk space.
When fs.s3a.fast.upload.buffer
is set to "disk", all data is buffered
to local hard disks prior to upload. This minimizes the amount of memory consumed, and
so eliminates heap size as the limiting factor in queued uploads — exactly as the
original "direct to disk" buffering used when
fs.s3a.fast.upload=false
.
Fast Upload with Byte Buffers
When fs.s3a.fast.upload.buffer
is set to "bytebuffer", all data is
buffered in "direct" ByteBuffers prior to upload. This may be faster than buffering to disk in cases such as when disk space is
small there may not be much disk space to buffer with (for example, when using tiny EC2
VMs).
The ByteBuffers are created in the memory of the JVM, but not in the Java Heap itself. The amount of data which can be buffered is limited by the Java runtime, the operating system, and, for YARN applications, the amount of memory requested for each container.
The slower the upload bandwidth to S3, the greater the risk of running out of memory — and so the more care is needed in tuning the upload thread settings to reduce the maximum amount of data which can be buffered awaiting upload (see below).
Fast Upload with Array Buffers
When fs.s3a.fast.upload.buffer
is set to "array", all data is buffered
in byte arrays in the JVM's heap prior to upload. This may be faster than buffering to disk.
The amount of data which can be buffered is limited by the available size of the JVM heap heap. The slower the write bandwidth to S3, the greater the risk of heap overflows. This risk can be mitigated by tuning the upload thread settings (see below).
Thread Tuning for S3A Fast Upload
Both the array and bytebuffer buffer mechanisms can consume very large amounts of memory, on-heap or off-heap respectively. The disk buffer mechanism does not use much memory up, but it consumes hard disk capacity.
If there are many output streams being written to in a single process, the amount of memory or disk used is the multiple of all stream's active memory and disk use.
You may need to perform careful tuning to reduce the risk of running out memory, especially if the data is buffered in memory. There are a number parameters which can be tuned:
The total number of threads available in the filesystem for data uploads or any other queued filesystem operation. This is set in
fs.s3a.threads.max
.The number of operations which can be queued for execution, awaiting a thread. This is set in
fs.s3a.max.total.tasks
.The number of blocks which a single output stream can have active (that is, being uploaded by a thread or queued in the filesystem thread queue). This is set in
fs.s3a.fast.upload.active.blocks
.The length of time that an idle thread can stay in the thread pool before it is retired. This is set in
fs.s3a.threads.keepalivetime
.
Table 3.3. S3A Fast Upload Tuning Options
Parameter | Default Value | Description |
---|---|---|
fs.s3a.fast.upload.active.blocks | 4 | Maximum number of blocks a single output stream can have active (uploading, or queued to the central FileSystem instance's pool of queued operations). This stops a single stream overloading the shared thread pool. |
fs.s3a.threads.max | 10 | The total number of threads available in the filesystem for data uploads or any other queued filesystem operation. |
fs.s3a.max.total.tasks | 5 | The number of operations which can be queued for execution |
fs.s3a.threads.keepalivetime | 60 | The number of seconds a thread can be idle before being terminated. |
When the maximum allowed number of active blocks of a single stream is reached, no
more blocks can be uploaded from that stream until one or more of those active block
uploads completes. That is, a write()
call which would trigger an upload of
a now full datablock will instead block until there is capacity in the queue.
Consider the following:
As the pool of threads set in
fs.s3a.threads.max
is shared (and intended to be used across all threads), a larger number here can allow for more parallel operations. However, as uploads require network bandwidth, adding more threads does not guarantee speedup.The extra queue of tasks for the thread pool (
fs.s3a.max.total.tasks
) covers all ongoing background S3A operations.When using memory buffering, a small value of
fs.s3a.fast.upload.active.blocks
limits the amount of memory which can be consumed per stream.When using disk buffering, a larger value of
fs.s3a.fast.upload.active.blocks
does not consume much memory. But it may result in a large number of blocks to compete with other filesystem operations.
We recommend a low value of fs.s3a.fast.upload.active.blocks
— enough
to start background upload without overloading other parts of the system. Then
experiment to see if higher values deliver more throughput — especially from VMs running
on EC2.