Improving Performance for S3
Use this checklist to ensure optimal performance when working with data in S3.
Checklist for Data
[ ]
Amazon S3 bucket is in same region as the EC2-hosted cluster. Learn more[ ]
The directory layout is "shallow". For directory listing performance, the directory layout prefers "shallow" directory trees with many files over deep directory trees with only a few files per directory.[ ]
The "pseudo" block size set infs.s3a.block.size
is appropriate for the work to be performed on the data.[ ]
Copy to HDFS any data that needs to be repeatedly read to HDFS.
Checklist for Cluster Configs
[ ]
Setyarn.scheduler.capacity.node-locality-delay
to 0 to improve container launch times. Learn more[ ]
When copying data using DistCp, use the following performance optimizations.[ ]
When reading ORC data, setfs.s3a.experimental.input.fadvise
torandom
. Learn more[ ]
If planning to use Hive with S3, review Improving Hive Performance with S3/ADLS/WASB.[ ]
If planning to use Spark with S3, review Improving Spark Performance with S3/ADLS/WASB.
Checklist for Code
[ ]
Application does not makerename()
calls. Where it does, it does not assume the operation is immediate.[ ]
Application does not assume thatdelete()
is near-instantaneous.[ ]
Application usesFileSystem.listFiles(path, recursive=true)
to list a directory tree.[ ]
Application prefers forward seeks through files, rather than full random IO.[ ]
If making "random" IO throughseek()
andread()
sequences or and Hadoop'sPositionedReadable
API,fs.s3a.experimental.input.fadvise
is set torandom
. Learn more
More