This alert is triggered if the number of down DataNodes in the cluster is greater than
the configured critical threshold. It uses the check_aggregate
plugin to
aggregate the results of Data node process down alert
checks.
The DataNodes are down
The DataNodes are not down but are not listening to the correct network port/address
The Nagios server cannot connect to one or more DataNodes
Check for dead DataNodes in the Services list.
Check for any errors in the DataNode logs (
/var/log/hadoop/hdfs
) and restart the DataNode hosts/processesRun the
netstat-tuplpn
command to check if the DataNode process is bound to the correct network port.Use
ping
to check the network connection between the Nagios server and the DataNodes.