5.5.1. HBasemaster process down alert

This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds. It uses the Nagios check_tcp plugin.

5.5.1.1. Potential causes

The HBase master process is down
The HBase master has shut itself down because there were problems in the dependent services, ZooKeeper or HDFS
The Nagios server cannot connect to the HBase master through the network

5.5.1.2. Possible remedies

Check the dependent services.
Look at the master log files (usually /var/log/hbase/*.log) for further information
Look at the configuration files (/etc/hbase/conf)
Use ping to check the network connection between the Nagios server and the HBase master
Restart the master

5.5.1.3. RegionServer process down alert

This alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds. It uses the Nagios check_tcp plugin.

5.5.1.3.1. Potential causes

Misconfiguration or less-than-ideal configuration has caused the RegionServers to crash
Cascading failures brought on by some workload has caused the RegionServers to crash
The RegionServers have shut themselves down on their own because there were problems in the dependent services, ZooKeeper or HDFS
GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper

5.5.1.3.2. Possible remedies

Check the dependent services to make sure they are operating correctly
Look at the RegionServer log files (usually /var/log/hbase/*.log) for further information
Look at the configuration files (/etc/hbase/conf)
If the failure was associated with a particular workload, try to understand the workload better
Restart the RegionServers

5.5.1.4. HBase percent region servers down alert

This alert is triggered if the configured percentage of Region Server processes cannot be determined to be up and listening on the network for the configured critical threshold.The default setting is 10% to produce a WARN alert and 30% to produce a CRITICAL alert. It uses the check_aggregate plugin to aggregate the results of RegionServer process down alert checks.

5.5.1.4.1. Potential causes

Misconfiguration or less-than-ideal configuration caused the RegionServers to crash
Cascading failures brought on by some workload caused the RegionServers to crash
The RegionServers shut themselves own because there were problems in the dependent services, ZooKeeper or HDFS
GC paused the RegionServer for too long and the RegionServers lost contact with Zookeeper

5.5.1.4.2. Possible remedies

Check the dependent services to make sure they are operating correctly.
Look at the RegionServer log files (usually /var/log/hbase/*.log) for further information
Look at the configuration files (/etc/hbase/conf)
If the failure was associated with a particular workload, try to understand the workload better
Restart the RegionServers

Legal notices