Cloud Data Access
Also available as:
PDF
loading table of contents...

Accelerating Sequential Reads Through Files in S3

[Note]Note

This optimization is meant specifically for Amazon S3.

The most effective way to scan a large file is in a single HTTPS request - which is the default behavior. If the scanning code skips parts of the file using seek(), then you can potentially improve the performance of these forward seeks by tuning the option spark.hadoop.fs.s3a.readahead.range. For example:

spark.hadoop.fs.s3a.readahead.range 512M

This option declares the number of bytes to read when seeking forwards in a file before closing and re-opening the HTTPS connection to S3. That close/reopen operation can be so slow that simply reading and discarding the data is actually faster. This is particularly true when working with remote S3 buckets of "long-haul" connections.