Managing and Monitoring a Cluster
Also available as:
PDF
loading table of contents...

NameNode high availability alerts

Descriptions, potential causes and possible rememdies for alerts related to NameNode high availability.

Table 1. NameNode HA Alerts
Alert Alert Type Description Potential Causes Possible Remedies
JournalNode Process WEB This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening on the network for the configured critical threshold, given in seconds.

The JournalNode process is down or not responding.

The JournalNode is not down but is not listening to the correct network port/address.

Check if the JournalNode process is running.
NameNode High Availability Health SCRIPT This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running. The Active, Standby or both NameNode processes are down.

On each host running NameNode, check for any errors in the logs /var/log/hadoop/hdfs/ and restart the NameNode host/process using Ambari Web.

On each host running NameNode, run the netstat-tuplpn command to check if the NameNode process is bound to the correct network port.

Percent JournalNodes Available AGGREGATE This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.

JournalNodes are down.

JournalNodes are not down but are not listening to the correct network port/address.

Check for non-operating JournalNodes in Ambari Web.
ZooKeeper Failover Controller Process PORT This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network. The ZKFC process is down or not responding. Check if the ZKFC process is running.