loading table of contents...

3.1. HDFS Service Alerts

Alert

Alert Type

Description

Potential Causes

Possible Remedies

NameNode Blocks Health

METRIC

This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold.

Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes.

The corrupt/missing blocks are from files with a replication factor of 1. New replicas cannot be created because the only replica of the block is missing.

For critical data, use a replication factor of 3.

Bring up the failed DataNodes with missing or corrupt blocks.

Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command.

Delete the corrupt files and recover them from backup, if it exists.

NFS Gateway Process

PORT

This host-level alert is triggered if the NFS Gateway process cannot be confirmed to be up and listening on the network.

NFS Gateway is down.

Check for dead NFS Gateway in Ambari Web.

DataNode Storage

METRIC

This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode JMX Servlet for the Capacity and Remaining properties.

Cluster storage is full.

If cluster storage is not full, DataNode is full.

If cluster still has storage, use Balancer to distribute the data to relatively less-used datanodes.

If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage run Balancer.

DataNode Process

PORT

This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on the network for the configured critical threshold, given in seconds.

DataNode process is down or not responding.

DataNode are not down but is not listening to the correct network port/address.

Check for dead DataNodes in Ambari Web.

Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary.

Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port.

DataNode Web UI

WEB

This host-level alert is triggered if the DataNode Web UI is unreachable.

The DataNode process is not running.

Check whether the DataNode process is running.

NameNode Host CPU Utilization

METRIC

This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning, 250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is only available if you are running JDK 1.7.

Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of an issue in the daemon.

Use the top command to determine which processes are consuming excess CPU.

Reset the offending process.

NameNode Web UI

WEB

This host-level alert is triggered if the NameNode Web UI is unreachable.

The NameNode process is not running.

Check whether the NameNode process is running.

Percent DataNodes with Available Space

AGGREGATE

This service-level alert is triggered if the storage if full on a certain percentage of DataNodes (10% warn, 30% critical).

Cluster storage is full.

If cluster storage is not full, DataNode is full.

If cluster still has storage, use Balancer to distribute the data to relatively less used DataNodes.

If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage run Balancer.

Percent DataNodes Available

AGGREGATE

This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. This aggregates the DataNode process alert.

DataNodes are down

DataNodes are not down but are not listening to the correct network port/address.

Check for dead DataNodes in Ambari Web.

Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes.

Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port.

NameNode RPC Latency

METRIC

This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations.

A job or an application is performing too many NameNode operations.

Review the job or the application for potential bugs causing it to perform too many NameNode operations.

NameNode Last Checkpoint

SCRIPT

This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of uncommitted transactions is beyond a certain threshold.

Too much time elapsed since last NameNode checkpoint.

Uncommitted transactions beyond threshold.

Set NameNode checkpoint.

Review threshold for uncommitted transactions.

Secondary NameNode Process

WEB

If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable when NameNode HA is configured.

The Secondary NameNode is not running.

Check that the Secondary DataNode process is running.

NameNode Directory Status

METRIC

This alert checks if the NameNode NameDirStatus metric reports a failed directory.

One or more of the directories are reporting as not healthy.

Check the NameNode UI for information about unhealthy directories.

HDFS Capacity Utilization

METRIC

This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold (80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties.

Cluster storage is full.

Delete unnecessary data.

Archive unused data.

Add more DataNodes.

Add more or larger disks to the DataNodes.

After adding more storage, run Balancer.

DataNode Health Summary

METRIC

This service-level alert is triggered if there are unhealthy DataNodes.

A DataNode is in an unhealthy state.

Check the NameNode UI for the list of dead DataNodes.

HDFS Pending Deletion Blocks

METRIC

This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property.

Large number of blocks are pending deletion.

HDFS Upgrade Finalized State

SCRIPT

This service-level alert is triggered if HDFS is not in the finalized state.

The HDFS upgrade is not finalized.

Finalize any upgrade you have in process.

DataNode Unmounted Data Dir

SCRIPT

This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became unmounted.

If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the root partition, which is undesirable.

Check the data directories to confirm they are mounted as expected.

DataNode Heap Usage

METRIC

This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are in percent.

NameNode Client RPC Queue Latency

SCRIPT

This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified threshold within an given period. This alert will monitor Hourly and Daily periods.

NameNode Client RPC Processing Latency

SCRIPT

This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified threshold within a given period. This alert will monitor Hourly and Daily periods.

NameNode Service RPC Queue Latency

SCRIPT

This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within a given period. This alert will monitor Hourly and Daily periods.

NameNode Service RPC Processing Latency

SCRIPT

This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within a given period. This alert will monitor Hourly and Daily periods.

HDFS Storage Capacity Usage

SCRIPT

This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified threshold within a given period. This alert will monitor Daily and Weekly periods.

NameNode Heap Usage

SCRIPT

This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold within a given period. This alert will monitor Daily and Weekly periods.