Configuring Fault Tolerance
Also available as:
PDF
loading table of contents...

Configuring the NameNode HA Cluster

First, add High Availability configurations to your HDFS configuration files. Start by taking the HDFS configuration files from the original NameNode in your HDP cluster, and use that as the base, adding various properties to those files.

After you have added the configurations below, ensure that the same set of HDFS configuration files are propagated to all nodes in the HDP cluster. This ensures that all the nodes and services are able to interact with the highly available NameNode.

Add the following configuration options to your hdfs-site.xml file:

  • dfs.nameservices

    Choose an arbitrary but logical name (for example, mycluster) as the value for dfs.nameservices option. This name will be used for both configuration and authority component of absolute HDFS paths in the cluster.

    <property>
     <name>dfs.nameservices</name>
     <value>mycluster</value>
     <description>Logical name for this new nameservice</description>
    </property>

    If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.

  • dfs.ha.namenodes.[$nameservice ID]

    Provide a list of comma-separated NameNode IDs. DataNodes use this this property to determine all the NameNodes in the cluster.

    For example, for the nameservice ID mycluster and individual NameNode IDs nn1, nn2, and nn3, the value of this property is:

    <property>
     <name>dfs.ha.namenodes.mycluster</name>
     <value>nn1,nn2,nn3</value>
     <description>Unique identifiers for each NameNode in the 
        nameservice</description>
    </property>
    Note
    Note

    The minimum number of NameNodes for HA is two, but you can configure more. You should not exceed five NameNodes due to communication overhead. Three NameNodes are recommended.

  • dfs.namenode.rpc-address.[$nameservice ID].[$name node ID]

    Use this property to specify the fully-qualified RPC address for each NameNode to listen on.

    Continuing with the previous example, set the full address and IPC port of the NameNode process for the NameNode IDs above -- nn1, nn2, and nn3.

    Note that there will be three separate configuration options.

    <property>
     <name>dfs.namenode.rpc-address.mycluster.nn1</name>
     <value>machine1.example.com:8020</value>
    </property>
    
    <property>
     <name>dfs.namenode.rpc-address.mycluster.nn2</name>
     <value>machine2.example.com:8020</value>
    </property>
    
    <property>
      <name>dfs.namenode.rpc-address.mycluster.nn3</name>
      <value>machine3.example.com:9820</value>
    </property>
  • dfs.namenode.http-address.[$nameservice ID].[$name node ID]

    Use this property to specify the fully-qualified HTTP address for each NameNode to listen on.

    Set the addresses for the NameNodes HTTP servers to listen on. For example:

    <property>
      <name>dfs.namenode.http-address.mycluster.nn1</name>
      <value>machine1.example.com:9870</value>
    </property>
    
    <property>
      <name>dfs.namenode.http-address.mycluster.nn2</name>
      <value>machine2.example.com:9870</value>
    </property>
    
    <property>
      <name>dfs.namenode.http-address.mycluster.nn3</name>
      <value>machine3.example.com:9870</value>
    </property>
    Note
    Note

    If you have Hadoop security features enabled, set the https-address for each NameNode.

  • dfs.namenode.shared.edits.dir

    Use this property to specify the URI that identifies a group of JournalNodes (JNs) where the NameNode will write/read edits.

    Configure the addresses of the JNs that provide the shared edits storage. The Active nameNode writes to this shared storage and the Standby NameNode reads from this location to stay up-to-date with all the file system changes.

    Although you must specify several JournalNode addresses, you must configure only one of these URIs for your cluster.

    The URI should be of the form:

    qjournal://host1:port1;host2:port2;host3:port3/journalId

    The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. You can reuse the nameservice ID for the journal identifier.

    For example, if the JournalNodes for a cluster were running on node1.example.com, node2.example.com, and node3.example.com, and the nameservice ID were mycluster, you would use the following value for this setting:

    <property>
     <name>dfs.namenode.shared.edits.dir</name>
     <value>qjournal://node1.example.com:8485;node2.example.com:
          8485;node3.example.com:8485/mycluster</value>
    </property>
    Note
    Note

    Note that the default port for the JournalNode is 8485.

  • dfs.client.failover.proxy.provider.[$nameservice ID]

    This property specifies the Java class that HDFS clients use to contact the Active NameNode. DFS Client uses this Java class to determine which NameNode is the current Active and therefore which NameNode is currently serving client requests.

    Use the ConfiguredFailoverProxyProvider implementation if you are not using a custom implementation.

    For example:

    <property>
     <name>dfs.client.failover.proxy.provider.mycluster</name>
     <value>org.apache.hadoop.hdfs.server.namenode.ha.
          ConfiguredFailoverProxyProvider</value>
    </property>
  • dfs.ha.fencing.methods

    This property specifies a list of scripts or Java classes that will be used to fence the Active NameNode during a failover.

    For maintaining system correctness, it is important to have only one NameNode in the Active state at any given time. When using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes.

    For this reason, it is still recommended to configure some fencing methods even when using the Quorum Journal Manager. To improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you must set some value for this setting, for example shell(/bin/true).

    The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. The following two methods are packaged with Hadoop: shell and sshfence. For information on implementing custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

    sshfence: SSH to the Active NameNode and kill the process.

    The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service's TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Ensure that you configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files.

    For example:

    <property>
     <name>dfs.ha.fencing.methods</name>
     <value>sshfence</value>
    </property>
    
    <property>
     <name>dfs.ha.fencing.ssh.private-key-files</name>
     <value>/home/exampleuser/.ssh/id_rsa</value>
    </property>

    Optionally, you can also configure a non-standard username or port to perform the SSH. You can also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. To configure non-standard username or port and timeout, see the properties given below:

    <property>
     <name>dfs.ha.fencing.methods</name>
     <value>sshfence([[username][:port]])</value>
    </property>
    
    <property>
     <name>dfs.ha.fencing.ssh.connect-timeout</name>
     <value>30000</value>
    </property>
    shell: Run an arbitrary shell command to fence the Active NameNode.

    The shell fencing method runs an arbitrary shell command:

    <property>
     <name>dfs.ha.fencing.methods</name>
     <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
    </property>

    The string between '(' and ')' is passed directly to a bash shell and may not include any closing parentheses.

    The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the '_' character replacing any '.' characters in the configuration keys. The configuration used has already had any namenode-specific configurations promoted to their generic forms -- for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.

    Additionally, the following variables (referring to the target node to be fenced) are also available:

    • $target_host: Hostname of the node to be fenced.
    • $target_port: IPC port of the node to be fenced

    • $target_address: The combination of $target_host and $target_port as host:port

    • $target_nameserviceid: The nameservice ID of the NN to be fenced

    • $target_namenodeid: The namenode ID of the NN to be fenced

    These environment variables may also be used as substitutions in the shell command. For example:

    <property>
     <name>dfs.ha.fencing.methods</name>
     <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
    </property>

    If the shell command returns an exit code of 0, the fencing is successful.

    Note
    Note

    This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (for example, by forking a subshell to kill its parent in some number of seconds).

  • fs.defaultFS The default path prefix used by the Hadoop FS client. Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. For example, for mycluster nameservice ID, this will be the value of the authority portion of all of your HDFS paths. Configure this property in the core-site.xml file:
    <property>
     <name>fs.defaultFS</name>
     <value>hdfs://mycluster</value>
    </property>
  • dfs.journalnode.edits.dir This is the absolute path on the JournalNode machines where the edits and other local state (used by the JNs) will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by either running multiple separate JournalNodes or by configuring this directory on a locally-attached RAID array. For example:
    <property>
     <name>dfs.journalnode.edits.dir</name>
     <value>/path/to/journal/node/local/data</value>
    </property>
    Note
    Note

    NameNode and NameNode HA failure may occur if the hadoop.security.authorization property in the core-site.xml file is set to true without Kerberos enabled on a NameNode HA cluster. Therefore you should only set this property to true when configuring HDP for Kerberos.