The out-of-the-box Nagios alerts displayed in Ambari Web cover a broad range of Hadoop behavior, but often an administrator will want to create additional alerts based on the needs of the individual installation. This section provides a high-level description of the process of adding those alerts so that they can be displayed in Ambari Web.
- Step 1: Create a Nagios Plugin Script/Executable
You need to begin by creating a Nagios plugin that can check for the particular conditions that you wish to monitor. There are many pre-written plugin scripts available at the Open Source Nagios Plugin project that can be customized for your specific purposes. You can also look at the OOTB plugin scripts that ship with Ambari. The default location for those files on the Nagios server is
/usr/lib64/nagios/plugins/
. For more information on creating Nagios plugins see the Nagios Plugin project page at http://nagiosplug.sourceforge.net/developer-guidelines.html.- Step 2: Save Your Plugin to the Plugin Directory on the Nagios Server Machine
The default location is
/usr/lib64/nagios/plugins/
.- Step 3: Define the Command to Execute the New Plug-In
In
/etc/nagios/objects
find and open thehadoop-commands.cfg
file with a text editor. Add the following information to the list:
define command{ command_name my_command_name command_line $USER1$/my_command_name.sh $HOSTADDRESS$ $ARG1$ $ARG2$
where:
command_name
is the command name.command_line
is the command with arguments used to launch the command.
Notice that the command_line
in the sample includes standard Nagios variables
like $ARG1$
and $HOSTADDRESS$
. The variable $USER1$
is the
Nagios plugin directory path. Write the full command with arguments down for later use.
- Step 4: Decide Which Hostgroup Your Plugin Should Check
In
/etc/nagios/objects
find and open thehadoop-hostgroups.cfgfile
. Write down thehostgroup_name
that corresponds to the set of hosts your check should run against.- Step 5: Decide Which Servicegroup Your Plugin Belongs To
In
/etc/nagios/objects
find and open thehadoop-servicegroups.cfg
file. Write down theservicegroup
_name that is most applicable, creating your own if necessary. These service groups are helpful in enabling/disabling multiple alerts as a unit using the Nagios Web UI.- Step 6: Define the Alert Entry
In
/etc/nagios/objects
find and open thehadoop-services.cfg
file. Create a service entry like the following and add it to the list:
define service { hostgroup_name nagios-server use hadoop-service service_description NAGIOS::Nagios status log staleness servicegroups NAGIOS check_command check_nagios!10!/var/ nagios/status.dat!/usr/bin/nagios normal_check_interval 5 retry_check_interval 0.5 max_check_attempts 2 }
where:
hostgroup
_name is the name you wrote down in Step 4use
indicates that this service inherits fromhadoop-service
. All services inherit fromhadoop-service
.service_description
is the name of the service/alert.
Follow the convention of using one of the predefined Hadoop service names as a prefix, followed by double colon and then a short description of the new alert. The service name prefix is used to determine under which Service the alert appears. The list of predefined Hadoop services names includes NAMENODE, HDFS, JOBTRACKER, MAPREDUCE, HBASEMASTER, HBASE, ZOOKEEPER, HIVE-METASTORE, OOZIE, and TEMPLETON.
servicegroups
is the group name you wrote down in Step 5.check_command
is the command_line you entered in thehadoop-commands.cfg
file in Step 3.
Note that in this format, arguments are separated by the “!” character.
normal_check_interval
is the number of minutes between regularly scheduled checks on the host as long as the check does not change the state.retry_check_interval
is the number of minutes between “retries”.
When a service changes state, Nagios can confirm that state change by retrying the check multiple times. This retry interval can be different than the original check interval.
max_check_attempts
is the maximum number of retry attempts.
Usually when the state of a service changes, this change is considered “soft” until multiple retries confirm it. Once the state change is confirmed, it is considered “hard”. This value indicates the number of attempts that must be made to confirm this state as “hard” and thus to display it.
- Step 7: Restart the Server to See the New Alerts
When you have finished making your edits, restart the Nagios service using following command as
root
user:
service nagios restart