Configure NodeManagers for Work-preserving Restart
Configure YARN to preserve the work of running applications in the event of a NodeManager restart.
NodeManager work-preserving enables a NodeManager to be restarted without losing the active containers running on the node. At a high level, the NodeManager stores any necessary state to a local state store as it processes container management requests. When the NodeManager restarts, it recovers by first loading the state for various subsystems, and then lets those subsystems perform recovery using the loaded state.
yarn-site.xml
file on all NodeManagers in the cluster.
Property:
yarn.nodemanager.recovery.enabled
Value:
true
Description:
Enables the NodeManager to recover after a restart.
Example:
<property <name>yarn.nodemanager.recovery.enabled</name> <value>true</value> </property>
Property:
yarn.nodemanager.recovery.dir
Value:
<yarn_log_dir_prefix>/nodemanager/recovery-state
Description:
The local file system directory in which the NodeManager will store state information when recovery is enabled.
Example:
<property> <name>yarn.nodemanager.recovery.dir</name> <value><yarn_log_dir_prefix>/nodemanager/recovery-state</value> </property>
You should also confirm that the yarn.nodemanager.address
port is
set to a non-zero value, e.g. 45454
:
<property> <name>yarn.nodemanager.address</name> <value>0.0.0.0:45454</value> </property>