NameNode HA Alerts
Alert |
Description |
Potential Causes |
Possible Remedies |
---|---|---|---|
JournalNode process |
This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening on the network for the configured critical threshold, given in seconds. |
The JournalNode process is down or not responding. The JournalNode is not down but is not listening to the correct network port/address. |
Check if the JournalNode process is dead. |
NameNode High Availability Health |
This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running. |
The Active, Standby or both NameNode processes are down. |
On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode host/process using Ambari Web. On each host running NameNode, run the netstat-tuplpn command to check if the NameNode process is bound to the correct network port. |
Percent JournalNodes Available |
This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks. |
JournalNodes are down. JournalNodes are not down but are not listening to the correct network port/address. |
Check for dead JournalNodes in Ambari Web. |
ZooKeeper Failover Controller process |
This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network. |
The ZKFC process is down or not responding. |
Check if the ZKFC process is running. |