YARN Resource Management
Also available as:
PDF

Configuring NodeManagers for Work-Preserving Restart

NodeManager work-preserving enables a NodeManager to be restarted without losing the active containers running on the node. At a high level, the NodeManager stores any necessary state to a local state store as it processes container management requests. When the NodeManager restarts, it recovers by first loading the state for various subsystems, and then lets those subsystems perform recovery using the loaded state.

To configure work-preserving restart for NodeManagers, set the following properties in the yarn-site.xml file on all NodeManagers in the cluster.

Property:

yarn.nodemanager.recovery.enabled

Value:

true

Description:

Enables the NodeManager to recover after a restart.

Example:

<property
    <name>yarn.nodemanager.recovery.enabled</name>
    <value>true</value>
 </property>

Property:

yarn.nodemanager.recovery.dir

Value:

<yarn_log_dir_prefix>/nodemanager/recovery-state

Description:

The local file system directory in which the NodeManager will store state information when recovery is enabled.

Example:

<property>
    <name>yarn.nodemanager.recovery.dir</name>
    <value><yarn_log_dir_prefix>/nodemanager/recovery-state</value>
 </property>

You should also confirm that the yarn.nodemanager.address port is set to a non-zero value, e.g. 45454:

<property>
    <name>yarn.nodemanager.address</name>
    <value>0.0.0.0:45454</value>
 </property>