Alert |
Description |
Potential Causes |
Possible Remedies |
---|---|---|---|
NameNode Blocks health |
This service-level alert is triggered if the number of corrupt or missing blocks exceeds the configured critical threshold. |
Some DataNodes are down and the replicas that are missing blocks are only on those DataNodes. The corrupt/missing blocks are from files with a replication factor of 1. New replicas cannot be created because the only replica of the block is missing. |
For critical data, use a replication factor of 3. Bring up the failed DataNodes with missing or corrupt blocks. Identify the files associated with the missing or corrupt blocks by running the Hadoop fsck command. Delete the corrupt files and recover them from backup, if it exists. |
NameNode process |
This host-level alert is triggered if the NameNode process cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds. |
The NameNode process is down on the HDFS master host. The NameNode process is up and running but not listening on the correct network port (default 8201). |
Check for any errors in the logs (/var/log/hadoop/hdfs/)and restart the NameNode host/process using the HMC Manage Services tab. Run the netstat-tuplpn command to check if the NameNode process is bound to the correct network port. |
DataNodeStorage |
This host-level alert is triggered if storage capacity is full on the DataNode (90% critical). It checks the DataNode JMX Servlet for the Capacity and Remaining properties. |
Cluster storage is full. If cluster storage is not full, DataNode is full. |
If cluster still has storage, use Balancer to distribute the data to relatively less-used datanodes. If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage run Balancer. |
DataNode process |
This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on the network for the configured critical threshold, given in seconds. |
DataNode process is down or not responding. DataNode are not down but is not listening to the correct network port/address. |
Check for dead DataNodes in Ambari Web. Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode, if necessary. Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port. |
DataNode Web UI |
This host-level alert is triggered if the DataNode Web UI is unreachable. |
The DataNode process is not running. |
Check whether the DataNode process is running. |
NameNode host CPU utilization |
This host-level alert is triggered if CPU utilization of the NameNode exceeds certain thresholds (200% warning, 250% critical). It checks the NameNode JMX Servlet for the SystemCPULoad property. This information is only available if you are running JDK 1.7. |
Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but this is generally the sign of an issue in the daemon. |
Use the top command to determine which processes are consuming excess CPU. Reset the offending process. |
NameNode Web UI |
This host-level alert is triggered if the NameNode Web UI is unreachable. |
The NameNode process is not running. |
Check whether the NameNode process is running. |
Percent DataNodes with Available Space |
This service-level alert is triggered if the storage if full on a certain percentage of DataNodes (10% warn, 30% critical). It aggregates the result from the check_datanode_storage.php plug-in. |
Cluster storage is full. If cluster storage is not full, DataNode is full. |
If cluster still has storage, use Balancer to distribute the data to relatively less used DataNodes. If the cluster is full, delete unnecessary data or add additional storage by adding either more DataNodes or more or larger disks to the DataNodes. After adding more storage run Balancer. |
Percent DataNodesAvailable |
This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. It uses the check_aggregate plug-in to aggregate the results of Data node process checks. |
DataNodes are down DataNodes are not down but are not listening to the correct network port/address. |
Check for dead DataNodes in Ambari Web. Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes. Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port. |
NameNode RPC latency |
This host-level alert is triggered if the NameNode operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations. |
A job or an application is performing too many NameNode operations. |
Review the job or the application for potential bugs causing it to perform too many NameNode operations. |
NameNode Last Checkpoint |
This alert will trigger if the last time that the NameNode performed a checkpoint was too long ago or if the number of uncommitted transactions is beyond a certain threshold. |
Too much time elapsed since last NameNode checkpoint. Uncommitted transactions beyond threshold. |
Set NameNode checkpoint. Review threshold for uncommitted transactions. |
Secondary NameNode Process |
If the Secondary NameNode process cannot be confirmed to be up and listening on the network. This alert is not applicable when NameNode HA is configured. |
The Secondary NameNode is not running. |
Check that the Secondary DataNode process is running. |
NameNode Directory Status |
This alert checks if the NameNode NameDirStatus metric reports a failed directory. |
One or more of the directories are reporting as not healthy. |
Check the NameNode UI for information about unhealthy directories. |
HDFS capacity utilization |
This service-level alert is triggered if the HDFS capacity utilization exceeds the configured critical threshold (80% warn, 90% critical). It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties. |
Cluster storage is full. |
Delete unnecessary data. Archive unused data. Add more DataNodes. Add more or larger disks to the DataNodes. After adding more storage, run Balancer. |
DataNode Health Summary |
This service-level alert is triggered if there are unhealthy DataNodes. |
A DataNode is in an unhealthy state. |
Check the NameNode UI for the list of dead DataNodes. |