5.3.2. Percent NodeManagers live

This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold. It uses the check_aggregate plug-in to aggregate the results of DataNode process alert checks. Potential causes
  • NodeManagers are down.

  • NodeManagers are not down but are not listening to the correct network port/address .

  • Nagios server cannot connect to one or more NodeManagers. Possible remedies
  • Check for dead NodeManagers.

  • Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart the NodeManagers hosts/processes, as necessary.

  • Run the netstat-tuplpn command to check if the NodeManager process is bound to the correct network port.

  • Use ping to check the network connection between the Nagios Server and the NodeManagers host.

