Accessing Cloud DataPDF version

S3 Performance Checklist

There are various types of checklists that you can use to ensure optimal performance when working with data in S3.

Checklist for Data

  • [ ] Amazon S3 bucket is in same region as the EC2-hosted cluster.

  • [ ] The directory layout is "shallow". For directory listing performance, the directory layout prefers "shallow" directory trees with many files over deep directory trees with only a few files per directory.

  • [ ] The "pseudo" block size set in fs.s3a.block.size is appropriate for the work to be performed on the data.

  • [ ] Copy to HDFS any data that needs to be repeatedly read to HDFS.

Checklist for Cluster Configurations

  • [ ] Set yarn.scheduler.capacity.node-locality-delay to 0 to improve container launch times.

Checklist for Code

  • [ ] Application does not make rename() calls. Where it does, it does not assume the operation is immediate.

  • [ ] Application does not assume that delete() is near-instantaneous.

  • [ ] Application uses FileSystem.listFiles(path, recursive=true) to list a directory tree.

  • [ ] Application prefers forward seeks through files, rather than full random IO.

  • [ ] If making "random" IO through seek() and read() sequences or and Hadoop's PositionedReadable API, fs.s3a.experimental.input.fadvise is set to random.

We want your opinion

How can we improve this page?

What kind of feedback do you have?