This alert is triggered if the number of down NodeManagers in the cluster is greater
than the configured critical threshold. It uses the check_aggregate
plug-in to aggregate the results of DataNode process alert checks.
NodeManagers are down.
NodeManagers are not down but are not listening to the correct network port/address .
Nagios server cannot connect to one or more NodeManagers.
Check for dead NodeManagers.
Check for any errors in the NodeManager logs (
/var/log/hadoop/yarn
) and restart the NodeManagers hosts/processes, as necessary.Run the
netstat-tuplpn
command to check if the NodeManager process is bound to the correct network port.Use
ping
to check the network connection between the Nagios Server and the NodeManagers host.