What's New in YARN and YARN Queue Manager

New features and functional updates for YARN and YARN Queue Manager are introduced in Cloudera Runtime 7.3.2, its service packs, and cumulative hotfixes.

Cloudera Runtime 7.3.2

Hadoop rebase summary
In Cloudera Runtime 7.3.2, Apache Hadoop is rebased to version 3.4.1. The Apache Hadoop upgrade improves overall performance and includes all the new features, improvements, and bug fixes from versions 3.2, 3.3, and 3.4.
Table 1. Improvements added from Apache Hadoop 3.2 to 3.4 versions
Apache Hadoop version Apache Jira Name Description
3.4 YARN-9279 YARN Hamlet Package Removal The deprecated org.apache.hadoop.yarn.webapp.hamlet package is now completely removed to improve maintainability. This is an incompatible change in Hadoop YARN 3.4.0+. Applications relying on this old package must be updated to use the org.apache.hadoop.yarn.webapp.hamlet2 package. This affects the YARN webapp component.
3.4 YARN-10820 Enhanced Reliability for YARN node list Command The thread-safety issue is fixed in GetClusterNodesRequestPBImpl, that previously caused intermittent failures, such as java.lang.ArrayIndexOutOfBoundsException, with the YARN node list command. This change affects the YARN client in Hadoop YARN 3.4.0, 3.3.2, and 3.2.4, thereby, eliminating random crashes when running the YARN node list command.
Table 2. Issues fixed between Apache Hadoop versions 3.2 to 3.4
Apache Hadoop version Apache Jira Name Description
3.3 MAPREDUCE-6190 MapReduce task initialization Timeout issue Previously, MapReduce jobs stopped responding if a task terminated before sending its first heartbeat, as the task never timed out and remained stuck indefinitely in a "STARTING" state. This issue is now resolved by introducing a dedicated timeout mechanism specifically designed to catch and terminate tasks that fail to initialize and send their first heartbeat.
3.4 YARN-9809 Miscommunication between RM and NM when NodeManagers are unhealthy Previously, if a NodeManager (NM) was registered in an unhealthy state, it did not communicate the status immediately. As a result, the Resource Manager (RM) mistakenly scheduled many containers to that unhealthy node before the first heartbeat was received. Once the first heartbeat finally arrived, the RM recognize the unhealthy status and abruptly ended all the recently scheduled containers, causing unnecessary task failures and wasted resources.

This issue is now resolved and NMs now explicitly supply their health status during their initial registration with the RM.