NameNode high availability alerts
Descriptions, potential causes and possible rememdies for alerts related to NameNode high availability.
Alert | Alert Type | Description | Potential Causes | Possible Remedies |
---|---|---|---|---|
JournalNode Process | WEB | This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening on the network for the configured critical threshold, given in seconds. |
The JournalNode process is down or not responding. The JournalNode is not down but is not listening to the correct network port/address. |
Check if the JournalNode process is running. |
NameNode High Availability Health | SCRIPT | This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running. | The Active, Standby or both NameNode processes are down. |
On each host running NameNode, check for any errors in the logs /var/log/hadoop/hdfs/ and restart the NameNode host/process using Ambari Web. On each host running NameNode, run the netstat-tuplpn command to check if the NameNode process is bound to the correct network port. |
Percent JournalNodes Available | AGGREGATE | This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks. |
JournalNodes are down. JournalNodes are not down but are not listening to the correct network port/address. |
Check for non-operating JournalNodes in Ambari Web. |
ZooKeeper Failover Controller Process | PORT | This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network. | The ZKFC process is down or not responding. | Check if the ZKFC process is running. |