6. Configuring New Alerts For Hadoop

The out-of-the-box Nagios alerts displayed in Ambari Web cover a broad range of Hadoop behavior, but often an administrator will want to create additional alerts based on the needs of the individual installation. This section provides a high-level description of the process of adding those alerts so that they can be displayed in Ambari Web.

Step 1: Create a Nagios Plugin Script/Executable: You need to begin by creating a Nagios plugin that can check for the particular conditions that you wish to monitor. There are many pre-written plugin scripts available at the Open Source Nagios Plugin project that can be customized for your specific purposes. You can also look at the OOTB plugin scripts that ship with Ambari. The default location for those files on the Nagios server is /usr/lib64/nagios/plugins/. For more information on creating Nagios plugins see the Nagios Plugin project page at http://nagiosplug.sourceforge.net/developer-guidelines.html.
Step 2: Save Your Plugin to the Plugin Directory on the Nagios Server Machine: The default location is /usr/lib64/nagios/plugins/.
Step 3: Define the Command to Execute the New Plug-In: In /etc/nagios/objects find and open the hadoop-commands.cfg file with a text editor. Add the following information to the list:

define command{
  command_name  my_command_name
  command_line $USER1$/my_command_name.sh
              $HOSTADDRESS$ $ARG1$ $ARG2$

where:

command_name is the command name.
command_line is the command with arguments used to launch the command.

Notice that the command_line in the sample includes standard Nagios variables like $ARG1$ and $HOSTADDRESS$ . The variable $USER1$ is the Nagios plugin directory path. Write the full command with arguments down for later use.

Step 4: Decide Which Hostgroup Your Plugin Should Check: In /etc/nagios/objects find and open the hadoop-hostgroups.cfgfile. Write down the hostgroup_name that corresponds to the set of hosts your check should run against.
Step 5: Decide Which Servicegroup Your Plugin Belongs To: In /etc/nagios/objects find and open the hadoop-servicegroups.cfg file. Write down the servicegroup_name that is most applicable, creating your own if necessary. These service groups are helpful in enabling/disabling multiple alerts as a unit using the Nagios Web UI.
Step 6: Define the Alert Entry: In /etc/nagios/objects find and open the hadoop-services.cfg file. Create a service entry like the following and add it to the list:

 define service {
   hostgroup_name       nagios-server
  use                   hadoop-service
  service_description   NAGIOS::Nagios status log staleness 
  servicegroups         NAGIOS 
  check_command         check_nagios!10!/var/
                          nagios/status.dat!/usr/bin/nagios 
  normal_check_interval 5 
  retry_check_interval  0.5 
  max_check_attempts    2 
               }

where:

hostgroup_name is the name you wrote down in Step 4
use indicates that this service inherits from hadoop-service. All services inherit from hadoop-service.
service_description is the name of the service/alert.

Follow the convention of using one of the predefined Hadoop service names as a prefix, followed by double colon and then a short description of the new alert. The service name prefix is used to determine under which Service the alert appears. The list of predefined Hadoop services names includes NAMENODE, HDFS, JOBTRACKER, MAPREDUCE, HBASEMASTER, HBASE, ZOOKEEPER, HIVE-METASTORE, OOZIE, and TEMPLETON.

servicegroups is the group name you wrote down in Step 5.
check_command is the command_line you entered in the hadoop-commands.cfg file in Step 3.

Note that in this format, arguments are separated by the “!” character.

normal_check_interval is the number of minutes between regularly scheduled checks on the host as long as the check does not change the state.
retry_check_interval is the number of minutes between “retries”.

When a service changes state, Nagios can confirm that state change by retrying the check multiple times. This retry interval can be different than the original check interval.

max_check_attempts is the maximum number of retry attempts.

Usually when the state of a service changes, this change is considered “soft” until multiple retries confirm it. Once the state change is confirmed, it is considered “hard”. This value indicates the number of attempts that must be made to confirm this state as “hard” and thus to display it.

Step 7: Restart the Server to See the New Alerts: When you have finished making your edits, restart the Nagios service using following command as root user:

service nagios restart

Legal notices