3.2. NameNode HA Alerts

Alert

Description

Potential Causes

Possible Remedies

JournalNode process

This host-level alert is triggered if the individual JournalNode process cannot be established to be up and listening on the network for the configured critical threshold, given in seconds.

The JournalNode process is down or not responding.

The JournalNode is not down but is not listening to the correct network port/address.

Check if the JournalNode process is dead.

NameNode High Availability Health

This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.

The Active, Standby or both NameNode processes are down.

On each host running NameNode, check for any errors in the logs (/var/log/hadoop/hdfs/) and restart the NameNode host/process using Ambari Web.

On each host running NameNode, run the netstat-tuplpn command to check if the NameNode process is bound to the correct network port.

Percent JournalNodes Available

This service-level alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold (33% warn, 50% crit ). It aggregates the results of JournalNode process checks.

JournalNodes are down.

JournalNodes are not down but are not listening to the correct network port/address.

Check for dead JournalNodes in Ambari Web.

ZooKeeper Failover Controller process

This alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network.

The ZKFC process is down or not responding.

Check if the ZKFC process is running.


loading table of contents...