Following recommendations can address "hotspotting," which is what happens when many queries access the same node, causing it to become overloaded.
When data for a particular query or the overall Impala workload is concentrated on a limited number of nodes, those nodes can “hotspot” due to scan fragments. For example, if you have a small dimension table that fits into a single HDFS block, multiple queries run the scan fragment on the same node causing it to get overloaded.
To avoid this hotspotting:
- For small dimension tables, increase the HDFS replication factor by
using the HDFS File System shell command
hdfs dfs -setrepoptions. See
setrepin the command reference for more information.
- For fact table partitions that are heavily queried, the replication
factor can be set temporarily. For example, you can set it for 7 days
using the above
setrepcommand. Then after 7 days have elapsed, you can reset the replication factor to 3.
Impala planned behaviour can be changed by setting the following query options:
- HDFS caching can be used to cache block replicas. This causes the Impala scheduler to randomly pick a node that is hosting a cached block replica for the scan. See Using HDFS Caching with Impala for more information.