S3 Performance Checklist
There are various types of checklists that you can use to ensure optimal performance when working with data in S3.
Checklist for Data
[ ]Amazon S3 bucket is in same region as the EC2-hosted cluster.
[ ]The directory layout is "shallow". For directory listing performance, the directory layout prefers "shallow" directory trees with many files over deep directory trees with only a few files per directory.
[ ]The "pseudo" block size set in
fs.s3a.block.sizeis appropriate for the work to be performed on the data.
[ ]Copy to HDFS any data that needs to be repeatedly read to HDFS.
Checklist for Cluster Configurations
yarn.scheduler.capacity.node-locality-delayto 0 to improve container launch times.
Checklist for Code
[ ]Application does not make
rename()calls. Where it does, it does not assume the operation is immediate.
[ ]Application does not assume that
[ ]Application uses
FileSystem.listFiles(path, recursive=true)to list a directory tree.
[ ]Application prefers forward seeks through files, rather than full random IO.
[ ]If making "random" IO through
read()sequences or and Hadoop's
fs.s3a.experimental.input.fadviseis set to