Addressing hotspotting

Following recommendations can address "hotspotting," which is what happens when many queries access the same node, causing it to become overloaded.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

When data for a particular query or the overall Impala workload is concentrated on a limited number of nodes, those nodes can “hotspot” due to scan fragments. For example, if you have a small dimension table that fits into a single HDFS block, multiple queries run the scan fragment on the same node causing it to get overloaded.

To avoid this hotspotting:

For small dimension tables, increase the HDFS replication factor by using the HDFS File System shell command hdfs dfs -setrep options. See setrep in the command reference for more information.
For fact table partitions that are heavily queried, the replication factor can be set temporarily. For example, you can set it for 7 days using the above setrep command. Then after 7 days have elapsed, you can reset the replication factor to 3.
Impala planned behaviour can be changed by setting the following query options:
- SET SCHEDULE_RANDOM_REPLICA=true See SCHEDULE_RANDOM_REPLICA Query Option for details.
- SET REPLICA_PREFERENCE=REMOTE See REPLICA_PREFERENCE Query Option for details.
HDFS caching can be used to cache block replicas. This causes the Impala scheduler to randomly pick a node that is hosting a cached block replica for the scan. See Using HDFS Caching with Impala for more information.