Optimizing S3A read performance for different file types
The S3A filesystem client supports the notion of input policies, similar to that of
the POSIX fadvise()
API call. This tunes the behavior of the S3A client to
optimize HTTP GET requests for reading different filetypes. To optimize HTTP GET requests, you
can take advantage of the S3A input policy option
fs.s3a.experimental.input.fadvise
.
Policy | Description |
---|---|
"normal" | This starts off as "sequential": it asks for the whole file. As soon as the application tries to seek backwards in the file it switches into "random" IO mode. This is not quite as efficient for Random IO as the "random" mode, because that first read may have to be aborted. However, because it is adaptive, it is the best choice if you do not know the data formats which will be read. |
"sequential" (default) |
Read through the file, possibly with some short forward seeks. The whole document is requested in a single HTTP request; forward seeks within the readahead range are supported by skipping over the intermediate data. This leads to maximum read throughput, but with very expensive backward seeks. |
"random" |
Optimized for random IO, specifically the Hadoop `PositionedReadable` operations — though `seek(offset); read(byte_buffer)` also benefits. Rather than ask for the whole file, the range of the HTTP request is set to that of the length of data desired in the `read` operation - rounded up to the readahead value set in `setReadahead()` if necessary. By reducing the cost of closing existing HTTP requests, this is highly efficient
for file IO accessing a binary file through a series of
|
For operations simply reading through a file (copying, DistCp, reading
gzip
or other compressed formats, parsing .csv
files,
and so on) the sequential
policy is appropriate. This is the default, so
you do not need to configure it.
For the specific case of high-performance random access IO (for example, accessing ORC
files), you may consider using the random
policy in the following
circumstances:
-
Data is read using the
PositionedReadable
API. -
There are long distance (many MB) forward seeks.
-
Backward seeks are as likely as forward seeks.
-
There is little or no use of single character
read()
calls or smallread(buffer)
calls. -
Applications are running close to the Amazon S3 data store; that is, the EC2 VMs on which the applications run are in the same region as the Amazon S3 bucket.
You must set the desired fadvise
policy in the configuration option
fs.s3a.experimental.input.fadvise
when the filesystem instance is
created. It can only be set on a per-filesystem basis, not on a per-file-read basis. You can
set it in core-site.xml
:
<property> <name>fs.s3a.experimental.input.fadvise</name> <value>random</value> </property>
Or, you can set it in the spark-defaults.conf
configuration of Spark:
spark.hadoop.fs.s3a.experimental.input.fadvise random
Be aware that this random access performance comes at the expense of sequential IO —
which includes reading files compressed with gzip
.
Improving S3A read performance using Vectored IO
The Hadoop Vectored IO API allows file formats like ORC and Parquet to fetch a set of data ranges in a single operation instead of issuing individual read calls for each range.
The Hadoop Vectored IO is an asynchronous API that enables the libraries to perform other tasks while waiting for the data. Different implementations of the Hadoop Vectored IO can support additional optimizations such as merging close-by data ranges and fetching data ranges in parallel from remote cloud storage. This results in a faster and more efficient data retrieval from cloud storage. The S3A connector offers a customized implementation that enables parallel and asynchronous reading of different data blocks.
You can enable Hadoop Vectored IO using
hive.exec.orc.use.hadoop-vectored.api=true
for Hive on ORC queries i, and
parquet.hadoop.vectored.io.enabled=true
for Spark on Parquet queries.
The S3A filesystem supports implementation of readVectored
API using the
client to provide a list of file ranges to read, which returns a future read object
associated with each range. For more information about the readVectored
API, see the FSDataInputStream specification and specification and Hadoop
Vectored IO: your data just got faster! article.
Property | Default value | Description |
---|---|---|
fs.s3a.vectored.read.min.seek.size |
4K |
Smallest reasonable seek in bytes to group ranges together during vectored read operation. |
fs.s3a.vectored.read.max.merged.size |
1M |
Largest merged read size in bytes to group ranges together during vectored
read. Setting this value to 0 disables merging of ranges. |
fs.s3a.vectored.active.ranged.reads |
4 |
Maximum number of range reads a single input stream can have active
(downloading, or queued) to the central FileSystem instance's pool of queued
operations. This stops a single stream overloading the shared thread pool |