5.4.3. Percent TaskTrackers down alert

This alert is triggered when the configured critical threshold of TaskTracker hosts become inaccessible in a short time-window. It uses the check_aggregate plugin to aggregate the results of individual Tasktracker process down alert checks. Potential causes
  • Connectivity issues such as general network problems, switch failures on the top-of-the-rack, etc. Possible remedies
  • Check the JobTracker UI for the list of TaskTrackers. If you see a lot of down TaskTrackers on a small set of racks, check for network connectivity issues between the racks

  • Check for errors in the TaskTracker logs on the individual machines (see TaskTracker Process Down Alert section for more information)

  • Fix the hardware/network issues and restart the TaskTrackers

