Accelerating Sequential Reads Through Files in S3
Note | |
---|---|
This optimization is meant specifically for Amazon S3. |
The most effective way to scan a large file is in a single HTTPS request - which is the
default behavior. If the scanning code skips parts of the file using seek()
,
then you can potentially improve the performance of these forward seeks by tuning the option
spark.hadoop.fs.s3a.readahead.range
. For
example:
spark.hadoop.fs.s3a.readahead.range 512M
This option declares the number of bytes to read when seeking forwards in a file before closing and re-opening the HTTPS connection to S3. That close/reopen operation can be so slow that simply reading and discarding the data is actually faster. This is particularly true when working with remote S3 buckets of "long-haul" connections.