Summary of best practice guidelines for partitioning.
- When Impala runs against HDFS-backed tables, scan parallelism is driven by the number of files.
- Small files seriously impact performance and put excess pressure on clusters, especially Namenode hosts and Impala Catalog Service hosts. Too few files can lead to hotspotting when too many tasks touch the same data, potentially overloading the node.
- Generally, keep files at around 256 MBs and avoid partitions that are less than 256 MBs in size.
- For small tables that are heavily queried, consider HDFS caching.
Partitioning is best approached as an iterative process, and not a scientific one. Following these simple general guidelines can yield a good experience. It is possible for you to make further improvements based on your unique usage patterns and experience.