NameNode HA Alerts
Alert |
Alert Type |
Description |
Potential Causes |
Possible Remedies |
---|---|---|---|---|
JournalNode Process |
WEB |
This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening on the network for the configured critical threshold, given in seconds. |
The JournalNode process is down or not responding. The JournalNode is not down but is not listening to the correct network port/address. |
Check if the JournalNode process is running. |
NameNode High Availability Health |
SCRIPT |
This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running. |
The Active, Standby or both NameNode processes are down. |
On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode host/process using Ambari Web. On each host running NameNode, run the netstat-tuplpn command to check if the NameNode process is bound to the correct network port. |
Percent JournalNodes Available |
AGGREGATE |
This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks. |
JournalNodes are down. JournalNodes are not down but are not listening to the correct network port/address. |
Check for non-operating JournalNodes in Ambari Web. |
ZooKeeper Failover Controller Process |
PORT |
This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network. |
The ZKFC process is down or not responding. |
Check if the ZKFC process is running. |