2. Configuring NodeManagers for Work-Preserving Restart

NodeManager work-preserving enables a NodeManager to be restarted without losing the active containers running on the node. At a high level, the NodeManager stores any necessary state to a local state store as it processes container management requests. When the NodeManager restarts, it recovers by first loading the state for various subsystems, and then lets those subsystems perform recovery using the loaded state.

To configure work-preserving restart for NodeManagers, set the following properties in the yarn-site.xml file on all NodeManagers in the cluster.

Property:yarn.nodemanager.recovery.enabledValue:true

Description: Enables the NodeManager to recover after a restart.

Example:

<property
    <name>yarn.nodemanager.recovery.enabled</name>
    <value>true</value>
 </property>

Property:yarn.nodemanager.recovery.dir Value:<yarn_log_dir_prefix>/nodemanager/recovery-state

Description: The local file system directory in which the NodeManager will store state information when recovery is enabled.

Example:

<property>
    <name>yarn.nodemanager.recovery.dir</name>
    <value><yarn_log_dir_prefix>/nodemanager/recovery-state</value>
 </property>

You should also confirm that the yarn.nodemanager.address port is set to a non-zero value, e.g. 45454:

<property>
    <name>yarn.nodemanager.address</name>
    <value>0.0.0.0:45454</value>
 </property>

loading table of contents...