Enabling HDFS HA

An HDFS high availability (HA) cluster uses two NameNodes—an active NameNode and a standby NameNode. Only one NameNode can be active at any point in time. HDFS HA depends on maintaining a log of all namespace modifications in a location available to both NameNodes, so that in the event of a failure, the standby NameNode has up-to-date information about the edits and location of blocks in the cluster.

Enabling HDFS HA Using Cloudera Manager

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

You can use Cloudera Manager to configure your CDH cluster for HDFS HA and automatic failover. In Cloudera Manager, HA is implemented using Quorum-based storage. Quorum-based storage relies upon a set of JournalNodes, each of which maintains a local edits directory that logs the modifications to the namespace metadata. Enabling HA enables automatic failover as part of the same command.

Enabling High Availability and Automatic Failover

The Enable High Availability workflow leads you through adding a second (standby) NameNode and configuring JournalNodes.

  1. Perform all the configuration and setup tasks described under Configuring Hardware for HDFS HA.
  2. Ensure that you have a ZooKeeper service.
  3. Go to the HDFS service.
  4. Select Actions > Enable High Availability. A screen showing the hosts that are eligible to run a standby NameNode and the JournalNodes displays.
    1. Specify a name for the nameservice and click Continue.
    2. In the NameNode Hosts field, click Select a host. The host selection dialog box displays.
    3. Check the checkbox next to the hosts where you want the standby NameNode to be set up and click OK. The standby NameNode cannot be on the same host as the active NameNode, and the host that is chosen should have the same hardware configuration (RAM, disk space, number of cores, and so on) as the active NameNode.
    4. In the JournalNode Hosts field, click Select hosts. The host selection dialog box displays.
    5. Check the checkboxes next to an odd number of hosts (a minimum of three) to act as JournalNodes and click OK. JournalNodes should be hosted on hosts with similar hardware specification as the NameNodes. Cloudera recommends that you put a JournalNode each on the same hosts as the active and standby NameNodes, and the third JournalNode on similar hardware, such as the JobTracker.
    6. Click Continue.
    7. In the JournalNode Edits Directory property, enter a directory location for the JournalNode edits directory into the fields for each JournalNode host.
      • You may enter only one directory for each JournalNode. The paths do not need to be the same on every JournalNode.
      • The directories you specify should be empty.
      • The directory owner should be hdfs:hadoop and must have read, write, and execute permission (drwx------).
    8. Extra Options: Decide whether Cloudera Manager should clear existing data in ZooKeeper, standby NameNode, and JournalNodes. If the directories are not empty (for example, you are re-enabling a previous HA configuration), Cloudera Manager will not automatically delete the contents—you can select to delete the contents by keeping the default checkbox selection. The recommended default is to clear the directories. If you choose not to do so, the data should be in sync across the edits directories of the JournalNodes and should have the same version data as the NameNodes.
    9. Click Continue.

      Cloudera Manager executes a set of commands that stop the dependent services, delete, create, and configure roles and directories as required, create a nameservice and failover controller, restart the dependent services, and deploy the new client configuration.

  5. If you want to use other services in a cluster with HA configured, follow the procedures in Configuring Other CDH Components to Use HDFS HA.

Fencing Methods

To ensure that only one NameNode is active at a time, a fencing method is required for the shared edits directory. During a failover, the fencing method is responsible for ensuring that the previous active NameNode no longer has access to the shared edits directory, so that the new active NameNode can safely proceed writing to it.

By default, Cloudera Manager configures HDFS to use a shell fencing method (shell(true)).

The fencing parameters are found in the Service-Wide > High Availability category under the configuration properties for your HDFS service.

Fencing Configuration

dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the active NameNode during a failover

It is desirable for correctness of the system that only one NameNode be in the active state at any given time.

When you use Quorum-based Storage, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata in a "split-brain" scenario. This is reflected in the default value of shell(true) for the dfs.ha.fencing.methods, which does not explicitly try to fence the standby NameNode.

In the absence of explicitly fencing, there is a narrow time window where the previously active NameNode may serve out-of-date responses to reads from clients. This window ends when the previously active NameNode tries to write to the JournalNodes, at which point the NameNode shuts down.

This window of stale read responses is rarely an issue for applications since there is no danger of split-brain corruption. In rare or special cases where strong read consistency is required, use an explicit fencing method such as the agent-based fencer.

The fencing methods used during a failover are configured as a carriage-return-separated list, and these will be attempted in order until one of them indicates that fencing has succeeded.

For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

Configuring the shell fencing method

shell - run an arbitrary shell command to fence the active NameNode

The shell fencing method runs an arbitrary shell command, which you can configure as shown below:

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>

The string between '(' and ')' is passed directly to a bash shell and cannot include any closing parentheses.

When executed, the first argument to the configured script will be the address of the NameNode to be fenced, followed by all arguments specified in the configuration.

The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the '_' character replacing any '.' characters in the configuration keys. The configuration used has already had any NameNode-specific configurations promoted to their generic forms - for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.

The following variables referring to the target node to be fenced are also available:
Variable Description
$target_host Hostname of the node to be fenced
$target_port IPC port of the node to be fenced
$target_address The above two variables, combined as host:port
$target_nameserviceid The nameservice ID of the NameNode to be fenced
$target_namenodeid The NameNode ID of the NameNode to be fenced

You can also use these environment variables as substitutions in the shell command itself. For example:

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
</property>

If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.