This alert is triggered if the number of down DataNodes in the cluster is greater than
the configured critical threshold. It uses the check_aggregate
plugin to
aggregate the results of Data node process down alert
checks.
DataNodes are down
DataNodes are not down but are not listening to the correct network port/address
Nagios server cannot connect to one or more DataNodes
Check for dead DataNodes in the Services list.
Check for any errors in the DataNode logs (
/var/log/hadoop/hdfs
) and restart the DataNode hosts/processesRun the
netstat-tuplpn
command to check if the DataNode process is bound to the correct network port.Use
ping
to check the network connection between the Nagios server and the DataNodes.