9. Validate Cluster Fail Over

Use the following tests to validate successful fail over for the NameNode master daemon service:

 9.1. Validate NameNode restart on primary machine

This test simulates the event where the NameNode master daemon dies on the primary (current) machine. A successful fail over should restart the NameNode process on the primary machine.

  • From the primary machine, perform the following command to shutdown the namenode:

    su - hdfs
    /usr/lib/hadoop/bin/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
            
  • Once the NameNode is shutdown, the monitor must detect the unavailability of the NameNode and should attempt to restart it. This process will take approximately 30 seconds.

 9.2. Validate NameNode fail over during soft reboot

This test simulates the event where the NameNode server becomes unavailable due to a planned reboot of the primary NameNode machine. A successful fail over should be able to relocate both the floating IP and the NameNode process to the secondary machine in the cluster. When the primary machine completes the restart, the NameNode process should again be relocated to the primary machine.

  • Perform a soft reboot of the server.

  • During the shutdown process, the other machine in the cluster will be notified and will begin the migration of the service.

  • The service should be available on the secondary machine in the cluster until the primary machine has completed its reboot.

 9.3. Validate NameNode fail over during hard reboot

This test simulates the event when the NameNode server becomes unavailable due to an unplanned reboot of the primary NameNode machine. A successful fail over should be able to relocate both the floating IP and the NameNode process to the secondary machine in the cluster. When the primary machine completes the restart, the NameNode process should again be relocated to the primary machine.

  • Perform a hard shutdown of the primary NameNode machine.

  • Once the secondary machine in the cluster detects the outage, it must start the NameNode process on the secondary machine.

  • The NameNode service should be available on the secondary machine until the primary machine completes its startup routine.


loading table of contents...