Addressing hotspotting

Following recommendations can address "hotspotting," which is what happens when many queries access the same node, causing it to become overloaded.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

When data for a particular query or the overall Impala workload is concentrated on a limited number of nodes, those nodes can “hotspot” due to scan fragments. For example, if you have a small dimension table that fits into a single HDFS block, multiple queries run the scan fragment on the same node causing it to get overloaded.

To avoid this hotspotting:

  • For small dimension tables, increase the HDFS replication factor by using the HDFS File System shell command hdfs dfs -setrep options. See setrep in the command reference for more information.
  • For fact table partitions that are heavily queried, the replication factor can be set temporarily. For example, you can set it for 7 days using the above setrep command. Then after 7 days have elapsed, you can reset the replication factor to 3.
  • Impala planned behaviour can be changed by setting the following query options:

  • HDFS caching can be used to cache block replicas. This causes the Impala scheduler to randomly pick a node that is hosting a cached block replica for the scan. See Using HDFS Caching with Impala for more information.