5.3.4. Percent DataNodes down alert

This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. It uses the check_aggregate plugin to aggregate the results of Data node process down alert checks.

5.3.4.1. Potential causes

The DataNodes are down
The DataNodes are not down but are not listening to the correct network port/address
The Nagios server cannot connect to one or more DataNodes

5.3.4.2. Possible remedies

Check for dead DataNodes in the Services list.
Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the DataNode hosts/processes
Run the netstat-tuplpn command to check if the DataNode process is bound to the correct network port.
Use ping to check the network connection between the Nagios server and the DataNodes.

Legal notices