YARN (MRv2) ResourceManager High Availability

The YARN ResourceManager is responsible for tracking the resources in a cluster and scheduling applications (for example, MapReduce jobs). Before CDH 5, the ResourceManager was a single point of failure in a YARN cluster. The ResourceManager high availability (HA) feature adds redundancy in the form of an active-standby ResourceManager pair to remove this single point of failure. Furthermore, upon failover from the active ResourceManager to the standby, the applications can resume from the last state saved to the state store; for example, map tasks in a MapReduce job are not run again if a failover to a new active ResourceManager occurs after the completion of the map phase. This allows events such the following to be handled without any significant performance effect on running applications:

Unplanned events such as machine crashes
Planned maintenance events such as software or hardware upgrades on the machine running the ResourceManager

ResourceManager HA requires ZooKeeper and HDFS services to be running.

Continue reading:

Architecture
Configuring YARN (MRv2) ResourceManager High Availability Using Cloudera Manager
Configuring YARN (MRv2) ResourceManager High Availability Using the Command Line
Using yarn rmadmin to Administer ResourceManager HA

Architecture

ResourceManager HA is implemented by means of an active-standby pair of ResourceManagers. On start-up, each ResourceManager is in the standby state; the process is started, but the state is not loaded. When one of the ResourceManagers is transitioning to the active state, the ResourceManager loads the internal state from the designated state store and starts all the internal services. The stimulus to transition to active comes from either the administrator (through the CLI) or through the integrated failover controller when automatic failover is enabled. The subsections that follow provide more details about the components of ResourceManager HA.

ResourceManager Restart

Restarting the ResourceManager allows for the recovery of in-flight applications if recovery is enabled. To achieve this, the ResourceManager stores its internal state, primarily application-related data and tokens, to the RMStateStore; the cluster resources are re-constructed when the NodeManagers connect. The available alternatives for the state store are MemoryRMStateStore (a memory-based implementation), and ZKRMStateStore (ZooKeeper-based implementation). Note that MemoryRMStateStore will not work for HA.

Fencing

When running two ResourceManagers, a split-brain situation can arise where both ResourceManagers assume they are active. To avoid this, only a single ResourceManager should be able to perform active operations and the other ResourceManager should be "fenced". The ZooKeeper-based state store (ZKRMStateStore) allows only a single ResourceManager to make changes to the stored state, implicitly fencing the other ResourceManager. This is accomplished by the ResourceManager claiming exclusive create-delete permissions on the root znode. The ACLs on the root znode are automatically created based on the ACLs configured for the store; in case of secure clusters, Cloudera recommends that you set ACLs for the root host such that both ResourceManagers share read-write-admin access, but have exclusive create-delete access. The fencing is implicit and does not require explicit configuration (as fencing in HDFS and MRv1 does). You can plug in a custom "Fencer" if you choose to – for example, to use a different implementation of the state store.

Configuration and FailoverProxy

In an HA setting, you should configure two ResourceManagers to use different ports (for example, ports on different hosts). To facilitate this, YARN uses the notion of an ResourceManager Identifier (rm-id). Each ResourceManager has a unique rm-id, and all the RPC configurations (<rpc-address>; for example yarn.resourcemanager.address) for that ResourceManager can be configured via <rpc-address>.<rm-id>. Clients, ApplicationMasters, and NodeManagers use these RPC addresses to talk to the active ResourceManager automatically, even after a failover. To achieve this, they cycle through the list of ResourceManagers in the configuration. This is done automatically and does not require any configuration (as it does in HDFS and MapReduce (MRv1)).

Automatic Failover

By default, ResourceManager HA uses ZKFC (ZooKeeper-based failover controller) for automatic failover in case the active ResourceManager is unreachable or goes down. Internally, the StandbyElector is used to elect the active ResourceManager. The failover controller runs as part of the ResourceManager (not as a separate process as in HDFS and MapReduce v1) and requires no further setup after the appropriate properties are configured in yarn-site.xml.

You can plug in a custom failover controller if you prefer.

Manual Transitions and Failover

You can use the command-line tool yarn rmadmin to transition a particular ResourceManager to active or standby state, to fail over from one ResourceManager to the other, to get the HA state of an ResourceManager, and to monitor an ResourceManager's health.

Configuring YARN (MRv2) ResourceManager High Availability Using Cloudera Manager

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

You can use Cloudera Manager to configure CDH 5 or higher for ResourceManager high availability (HA). Cloudera Manager supports automatic failover of the ResourceManager. It does not provide a mechanism to manually force a failover through the Cloudera Manager user interface.

Enabling High Availability

Go to the YARN service.
Select Actions > Enable High Availability. A screen showing the hosts that are eligible to run a standby ResourceManager displays. The host where the current ResourceManager is running is not available as a choice.
Select the host where you want the standby ResourceManager to be installed, and click Continue. Cloudera Manager proceeds to run a set of commands that stop the YARN service, add a standby ResourceManager, initialize the ResourceManager high availability state in ZooKeeper, restart YARN, and redeploy the relevant client configurations.
Work preserving recovery is enabled for the ResourceManager by default when you enable ResourceManager HA in Cloudera Manager. For more information, including instructions on disabling work preserving recovery, see Work Preserving Recovery for YARN Components.

Disabling High Availability

Go to the YARN service.
Select Actions > Disable High Availability. A screen showing the hosts running the ResourceManagers displays.
Select which ResourceManager (host) you want to remain as the single ResourceManager, and click Continue. Cloudera Manager runs a set of commands that stop the YARN service, remove the standby ResourceManager and the Failover Controller, restart the YARN service, and redeploy client configurations.

Configuring YARN (MRv2) ResourceManager High Availability Using the Command Line

To configure and start ResourceManager HA, proceed as follows.

Stop the YARN daemons
Configure Manual Failover, and Optionally Automatic Failover
Restart the YARN daemons

Stop the YARN daemons

Stop the MapReduce JobHistory service, ResourceManager service, and NodeManager on all hosts where they are running, as follows:

$ sudo service hadoop-mapreduce-historyserver stop
$ sudo service hadoop-yarn-resourcemanager stop
$ sudo service hadoop-yarn-nodemanager stop

Configure Manual Failover, and Optionally Automatic Failover

To configure failover:

Name	Used On	Default Value	Recommended Value	Description
`yarn.resourcemanager. ha.enabled`	ResourceManager, NodeManager, Client	`false`	`true`	Enable HA
`yarn.resourcemanager. ha.rm-id`s	ResourceManager, NodeManager, Client	(None)	Cluster-specific, for example: `rm1,rm2`	Comma-separated list of ResourceManager ids in this cluster.
`yarn.resourcemanager. ha.id`	ResourceManager	(None)	ResourceManager-specific, for example: `rm1`	Id of the current ResourceManager. Must be set explicitly on each ResourceManager to the appropriate value.
`yarn.resourcemanager. address.<rm-id>`	ResourceManager, Client	(None)	Cluster-specific	The value of `yarn.resourcemanager. address` (Client-ResourceManager RPC) for this ResourceManager. Must be set for all ResourceManagers.
`yarn.resourcemanager. scheduler.address.<rm-id>`	ResourceManager, Client	(None)	Cluster-specific	The value of `yarn.resourcemanager. scheduler.address` (AM-ResourceManager RPC) for this ResourceManager. Must beset for all ResourceManagers.
`yarn.resourcemanager. admin.address.<rm-id>`	ResourceManager, Client/Admin	(None)	Cluster-specific	The value of `yarn.resourcemanager. admin.address` (ResourceManager administration) for this ResourceManager. Must be set for all ResourceManagers.
`yarn.resourcemanager. resource-tracker.address. <rm-id>`	ResourceManager, NodeManager	(None)	Cluster-specific	The value of `yarn.resourcemanager. resource-tracker.address` (NM-ResourceManager RPC) for this ResourceManager. Must be set for all ResourceManagers.
`yarn.resourcemanager. webapp.address.<rm-id>`	ResourceManager, Client	(None)	Cluster-specific	The value of `yarn.resourcemanager. webapp.address` (ResourceManager webapp) for this ResourceManager.Must be set for all ResourceManagers.
`yarn.resourcemanager. recovery.enabled`	ResourceManager	`false`	`true`	Enable job recovery on ResourceManager restart or failover.
`yarn.resourcemanager. store.class`	ResourceManager	`org.apache.hadoop. yarn.server. resourcemanager. recovery. MemeoryRMStateStore`	`org.apache. hadoop.yarn. server. resourcemanager. recovery. ZKRMStateStore`	The RMStateStore implementation to use to store the ResourceManager's internal state. The ZooKeeper- based store supports fencing implicitly. That it, it allows a single ResourceManager to make multiple changes at a time, and hence is recommended.
`yarn.resourcemanager. zk-address`	ResourceManager	(None)	Cluster- specific	The ZooKeeper quorum to use to store the ResourceManager's internal state.
`yarn.resourcemanager. zk-acl`	ResourceManager	`world:anyone:rwcda`	Cluster- specific	The ACLs the ResourceManager uses for the znode structure to store the internal state.
`yarn.resourcemanager.zk- state-store.root-node.acl`	ResourceManager	(None)	Cluster- specific	The ACLs used for the root host of the ZooKeeper state store. The ACLs set here should allow both ResourceManagers to read, write, and administer, with exclusive access to create and delete. If nothing is specified, the root host ACLs are automatically generated on the basis of the ACLs specified through `yarn.resourcemanager.zk-acl`. But that leaves a security hole in a secure setup.

To configure automatic failover:

Configure the following additional properties in yarn-site.xml to configure automatic failover.

Configure work preserving recovery:

Optionally, you can configure work preserving recovery for the Resource Manager and Node Managers. See Work Preserving Recovery for YARN Components.

Name	Used On	Default Value	Recommended Value	Description
`yarn.resourcemanager. ha.automatic-failover.enabled`	ResourceManager	`true`	`true`	Enable automatic failover
`yarn.resourcemanager. ha.automatic-failover.embedded`	ResourceManager	`true`	true	Use the EmbeddedElectorService to pick an active ResourceManager from the ensemble
`yarn.resourcemanager. cluster-id`	ResourceManager	No default value.	Cluster- specific	Cluster name used by the ActiveStandbyElector to elect one of the ResourceManagers as leader.

The following is a sample yarn-site.xml showing these properties configured, including work preserving recovery for both ResourceManager and NM:

<configuration>
<!-- Resource Manager Configs -->
  <property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
    <value>2000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>pseudo-yarn-rm-cluster</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>rm1,rm2</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.id</name>
    <value>rm1</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>
  <property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>localhost:2181</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
    <value>5000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
    <value>true</value>
  </property>

  <!-- ResourceManager1 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm1</name>
    <value>host1:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
    <value>host1:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm1</name>
    <value>host1:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
    <value>host1:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
    <value>host1:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
    <value>host1:23141</value>
  </property>

  <!-- ResourceManager2 configs -->
  <property>
    <name>yarn.resourcemanager.address.rm2</name>
    <value>host2:23140</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
    <value>host2:23130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.https.address.rm2</name>
    <value>host2:23189</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
    <value>host2:23188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
    <value>host2:23125</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
    <value>host2:23141</value>
  </property>

<!-- Host Manager Configs -->
  <property>
    <description>Address where the localizer IPC is.</description>
    <name>yarn.nodemanager.localizer.address</name>
    <value>0.0.0.0:23344</value>
  </property>
  <property>
    <description>NM Webapp address.</description>
    <name>yarn.nodemanager.webapp.address</name>
    <value>0.0.0.0:23999</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/tmp/pseudo-dist/yarn/local</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/tmp/pseudo-dist/yarn/log</value>
  </property>
  <property>
    <name>mapreduce.shuffle.port</name>
    <value>23080</value>
  </property>
  <property>
    <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
    <value>true</value>
  </property>
</configuration>

Restart the YARN daemons

Start the MapReduce JobHistory server, ResourceManager, and NodeManager on all hosts where they were previously running, as follows:

$ sudo service hadoop-mapreduce-historyserver start
$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start

Using `yarn rmadmin` to Administer ResourceManager HA

You can use yarn rmadmin on the command line to manage your ResourceManager HA deployment. yarn rmadmin has the following options related to ResourceManager HA:

[-transitionToActive serviceId]
[-transitionToStandby serviceId]
[-getServiceState serviceId]
[-checkHealth <serviceId]
[-help <command>]

where serviceId is the rm-id.

MapReduce (MRv1) and YARN (MRv2) High Availability

Work Preserving Recovery for YARN Components