What's New in YARN and YARN Queue Manager

New features and functional updates for YARN and YARN Queue Manager are introduced in Cloudera Runtime 7.3.2, its service packs, and cumulative hotfixes.

Cloudera Runtime 7.3.2

Hadoop rebase summary

In Cloudera Runtime 7.3.2, Apache Hadoop is rebased to version 3.4.1. The Apache Hadoop upgrade improves overall performance and includes all the new features, improvements, and bug fixes from versions 3.2, 3.3, and 3.4.

Table 1. Improvements added from Apache Hadoop 3.2 to 3.4 versions
Apache Hadoop version	Apache Jira	Name	Description
3.4	YARN-9279	YARN Hamlet Package Removal	The deprecated `org.apache.hadoop.yarn.webapp.hamlet` package is now completely removed to improve maintainability. This is an incompatible change in Hadoop YARN 3.4.0+. Applications relying on this old package must be updated to use the `org.apache.hadoop.yarn.webapp.hamlet2` package. This affects the YARN webapp component.
3.4	YARN-10820	Enhanced Reliability for YARN node list Command	The thread-safety issue is fixed in `GetClusterNodesRequestPBImpl`, that previously caused intermittent failures, such as `java.lang.ArrayIndexOutOfBoundsException`, with the YARN node list command. This change affects the YARN client in Hadoop YARN 3.4.0, 3.3.2, and 3.2.4, thereby, eliminating random crashes when running the YARN node list command.

Table 2. Issues fixed between Apache Hadoop versions 3.2 to 3.4
Apache Hadoop version	Apache Jira	Name	Description
3.3	MAPREDUCE-6190	MapReduce task initialization Timeout issue	Previously, MapReduce jobs stopped responding if a task terminated before sending its first heartbeat, as the task never timed out and remained stuck indefinitely in a "STARTING" state. This issue is now resolved by introducing a dedicated timeout mechanism specifically designed to catch and terminate tasks that fail to initialize and send their first heartbeat.
3.4	YARN-9809	Miscommunication between RM and NM when NodeManagers are unhealthy	Previously, if a NodeManager (NM) was registered in an unhealthy state, it did not communicate the status immediately. As a result, the Resource Manager (RM) mistakenly scheduled many containers to that unhealthy node before the first heartbeat was received. Once the first heartbeat finally arrived, the RM recognize the unhealthy status and abruptly ended all the recently scheduled containers, causing unnecessary task failures and wasted resources. This issue is now resolved and NMs now explicitly supply their health status during their initial registration with the RM.