Addressing hotspotting
Following recommendations can address "hotspotting," which is what happens when many queries access the same node, causing it to become overloaded.
By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.
When data for a particular query or the overall Impala workload is concentrated on a limited number of nodes, those nodes can “hotspot” due to scan fragments. For example, if you have a small dimension table that fits into a single HDFS block, multiple queries run the scan fragment on the same node causing it to get overloaded.
To avoid this hotspotting:
- For small dimension tables, increase the HDFS replication factor by
using the HDFS File System shell command
hdfs dfs -setrep
options. Seesetrep
in the command reference for more information. - For fact table partitions that are heavily queried, the replication
factor can be set temporarily. For example, you can set it for 7 days
using the above
setrep
command. Then after 7 days have elapsed, you can reset the replication factor to 3. -
Impala planned behaviour can be changed by setting the following query options:
SET SCHEDULE_RANDOM_REPLICA=true
See SCHEDULE_RANDOM_REPLICA Query Option for details.SET REPLICA_PREFERENCE=REMOTE
See REPLICA_PREFERENCE Query Option for details.
- HDFS caching can be used to cache block replicas. This causes the Impala scheduler to randomly pick a node that is hosting a cached block replica for the scan. See Using HDFS Caching with Impala for more information.