3.1. HDFS Service Alerts
Alert |
Description |
Potential Causes |
Possible Remedies |
---|---|---|---|
NameNode Blocks Health |
This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold. |
Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes. The corrupt/missing blocks are from files with a replication factor of 1. New replicas cannot be created because the only replica of the block is missing. |
For critical data, use a replication factor of 3. Bring up the failed DataNodes with missing or corrupt blocks. Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command. Delete the corrupt files and recover them from backup, if it exists. |
NFS Gateway Process |
This host-level alert is triggered if the NFS Gateway process cannot be confirmed to be up and listening on the network. |
NFS Gateway is down. |
Check for dead NFS Gateway in Ambari Web. |
DataNode Storage |
This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode JMX Servlet for the Capacity and Remaining properties. |
Cluster storage is full. If cluster storage is not full, DataNode is full. |
If cluster still has storage, use Balancer to distribute the data to relatively less-used datanodes. If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage run Balancer. |
DataNode Process |
This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on the network for the configured critical threshold, given in seconds. |
DataNode process is down or not responding. DataNode are not down but is not listening to the correct network port/address. |
Check for dead DataNodes in Ambari Web. Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary. Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port. |
DataNode Web UI |
This host-level alert is triggered if the DataNode Web UI is unreachable. |
The DataNode process is not running. |
Check whether the DataNode process is running. |
NameNode Host CPU Utilization |
This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning, 250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is only available if you are running JDK 1.7. |
Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of an issue in the daemon. |
Use the top command to determine which processes are consuming excess CPU. Reset the offending process. |
NameNode Web UI |
This host-level alert is triggered if the NameNode Web UI is unreachable. |
The NameNode process is not running. |
Check whether the NameNode process is running. |
Percent DataNodes with Available Space |
This service-level alert is triggered if the storage if full on a certain percentage of DataNodes (10% warn, 30% critical). |
Cluster storage is full. If cluster storage is not full, DataNode is full. |
If cluster still has storage, use Balancer to distribute the data to relatively less used DataNodes. If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage run Balancer. |
Percent DataNodes Available |
This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. This aggregates the DataNode process alert. |
DataNodes are down DataNodes are not down but are not listening to the correct network port/address. |
Check for dead DataNodes in Ambari Web. Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes. Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port. |
NameNode RPC Latency |
This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations. |
A job or an application is performing too many NameNode operations. |
Review the job or the application for potential bugs causing it to perform too many NameNode operations. |
NameNode Last Checkpoint |
This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of uncommitted transactions is beyond a certain threshold. |
Too much time elapsed since last NameNode checkpoint. Uncommitted transactions beyond threshold. |
Set NameNode checkpoint. Review threshold for uncommitted transactions. |
Secondary NameNode Process |
If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable when NameNode HA is configured. |
The Secondary NameNode is not running. |
Check that the Secondary DataNode process is running. |
NameNode Directory Status |
This alert checks if the NameNode NameDirStatus metric reports a failed directory. |
One or more of the directories are reporting as not healthy. |
Check the NameNode UI for information about unhealthy directories. |
HDFS Capacity Utilization |
This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold (80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties. |
Cluster storage is full. |
Delete unnecessary data. Archive unused data. Add more DataNodes. Add more or larger disks to the DataNodes. After adding more storage, run Balancer. |
DataNode Health Summary |
This service-level alert is triggered if there are unhealthy DataNodes. |
A DataNode is in an unhealthy state. |
Check the NameNode UI for the list of dead DataNodes. |
HDFS Pending Deletion Blocks |
This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property. |
Large number of blocks are pending deletion. | |
HDFS Upgrade Finalized State |
This service-level alert is triggered if HDFS is not in the finalized state. |
The HDFS upgrade is not finalized. |
Finalize any upgrade you have in process. |
DataNode Unmounted Data Dir |
This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became unmounted. |
If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the root partition, which is undesirable. |
Check the data directories to confirm they are mounted as expected. |