Configuring Fault Tolerance
Also available as:
loading table of contents...

Configure and Deploy Automatic Failover

Configure automatic failover, initialize HA state in Zookeeper, and start the nodes in the cluster.

  1. Configure automatic failover.
    • Set up your cluster for automatic failover. Add the following property to the hdfs-site.xml file for all of the NameNode machines:

    • List the host-port pairs running the ZooKeeper service. Add the following property to the core-site.xml file for all of the NameNode machines:


      Suffix the configuration key with the nameservice ID to configure the above settings on a per-nameservice basis. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.$my-nameservice-id.

  2. Initialize HA state in ZooKeeper.

    Execute the following command on NN1:

    hdfs zkfc -formatZK -force

    This command creates a znode in ZooKeeper. The automatic failover system stores uses this znode for data storage.

  3. Check to see if ZooKeeper is running. If not, start ZooKeeper by executing the following command on the ZooKeeper host machines.
    su - zookeeper -c "export ZOOCFGDIR=/usr/hdp/current/zookeeper-server/conf ; export ZOOCFG=zoo.cfg; source /usr/hdp/current/zookeeper-server/conf/ ; /usr/hdp/current/zookeeper-server/bin/ start"
  4. Start the JournalNodes, NameNodes, and DataNodes using the instructions provided in the Controlling HDP Services Manually chapter of the HDP Administration Guide.
  5. Start the ZooKeeper Failover Controller (ZKFC) by executing the following command:
    su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/ start zkfc"

    The sequence of starting ZKFC determines which NameNode will become Active. For example, if ZKFC is started on NN1 first, it will cause NN1 to become Active.


    To convert a non-HA cluster to an HA cluster, Hortonworks recommends that you run the bootstrapStandby command (this command is used to initialize NN2) before you start ZKFC on any of the NameNode machines.

  6. Verify automatic failover.
    1. Locate the Active NameNode.

      Use the NameNode web UI to check the status for each NameNode host machine.

    2. Cause a failure on the Active NameNode host machine.

      For example, you can use the following command to simulate a JVM crash:

      kill -9 $PID_of_Active_NameNode

      Or, you could power cycle the machine or unplug its network interface to simulate outage.

    3. The Standby NameNode should now automatically become Active within several seconds.

      The amount of time required to detect a failure and trigger a failover depends on the configuration of property (default value is 5 seconds).

    4. If the test fails, your HA settings might be incorrectly configured.

      Check the logs for the zkfc daemons and the NameNode daemons to diagnose the issue.