Managing and Monitoring a Cluster
Also available as:
PDF
loading table of contents...

HBase service alerts

Descriptions, potential causes and possible rememdies for alerts triggered by HBase.

Table 1. HBase Service Alerts
Alert Description Potential Causes Possible Remedies
Percent RegionServers Available This service-level alert is triggered if the configured percentage of Region Server processes cannot be determined to be up and listening on the network for the configured critical threshold. The default setting is 10% to produce a WARN alert and 30% to produce a CRITICAL alert. It aggregates the results of RegionServer process down checks.

Misconfiguration or less-than-ideal configuration caused the RegionServers to crash.

Cascading failures brought on by some workload caused the RegionServers to crash.

The RegionServers shut themselves own because there were problems in the dependent services, ZooKeeper or HDFS.

GC paused the RegionServer for too long and the RegionServers lost contact with ZooKeeper.

Check the dependent services to make sure they are operating correctly.

Look at the RegionServer log files (usually /var/log/hbase/*.log) for further information.

If the failure was associated with a particular workload, try to understand the workload better.

Restart the RegionServers.

HBase Master Process This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds.

The HBase master process is down.

The HBase master has shut itself down because there were problems in the dependent services, ZooKeeper or HDFS.

Check the dependent services.

Look at the master log files (usually /var/log/hbase/*.log) for further information.

Look at the configuration files /etc/hbase/conf.

Restart the master.

HBase Master CPU Utilization This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain thresholds (200% warning, 250% critical). It checks the HBase Master JMX Servlet for the SystemCPULoad property. This information is only available if you are running JDK 1.7. Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of an issue in the daemon.

Use the top command to determine which processes are consuming excess CPU

Reset the offending process.

RegionServers Health Summary This service-level alert is triggered if there are unhealthy RegionServers.

The RegionServer process is down on the host.

The RegionServer process is up and running but not listening on the correct network port (default 60030).

Check for dead RegionServer in Ambari Web.
HBase RegionServer Process This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds.

The RegionServer process is down on the host.

The RegionServer process is up and running but not listening on the correct network port (default 60030).

Check for any errors in the logs /var/log/hbase/ and restart the RegionServer process using Ambari Web.

Run the netstat-tuplpn command to check if the RegionServer process is bound to the correct network port.