Use the following tests to validate successful fail over for the NameNode master daemon service:
This test simulates the event where the NameNode master daemon dies on the primary (current) machine. A successful fail over should restart the NameNode process on the primary machine.
From the primary machine, perform the following command to shutdown the namenode:
su - hdfs /usr/lib/hadoop/bin/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
Once the NameNode is shutdown, the monitor must detect the unavailability of the NameNode and should attempt to restart it. This process will take approximately 30 seconds.
This test simulates the event where the NameNode server becomes unavailable due to a planned reboot of the primary NameNode machine. A successful fail over should be able to relocate both the floating IP and the NameNode process to the secondary machine in the cluster. When the primary machine completes the restart, the NameNode process should again be relocated to the primary machine.
Perform a soft reboot of the server.
During the shutdown process, the other machine in the cluster will be notified and will begin the migration of the service.
The service should be available on the secondary machine in the cluster until the primary machine has completed its reboot.
This test simulates the event when the NameNode server becomes unavailable due to an unplanned reboot of the primary NameNode machine. A successful fail over should be able to relocate both the floating IP and the NameNode process to the secondary machine in the cluster. When the primary machine completes the restart, the NameNode process should again be relocated to the primary machine.
Perform a hard shutdown of the primary NameNode machine.
Once the secondary machine in the cluster detects the outage, it must start the NameNode process on the secondary machine.
The NameNode service should be available on the secondary machine until the primary machine completes its startup routine.