The out-of-the-box Nagios alerts displayed in Ambari Web cover a broad range of Hadoop behavior, but often you want to create additional alerts based on the needs of the individual installation. This section provides a high-level description of the process of adding those alerts so that they can be displayed in Ambari Web.
- Step 1: Create a Nagios Plugin Script/Executable
You must begin by creating a Nagios plugin that can check for the particular conditions that you wish to monitor. There are many pre-written plugin scripts available at the Open Source Nagios Plugin project that can be customized for your specific purposes. You can also look at the OOTB plugin scripts that ship with Ambari. The default location for those files on the Nagios server is
/usr/lib64/nagios/plugins/
. For more information on creating Nagios plugins see the Nagios Plugin project page at http://nagiosplug.sourceforge.net/developer-guidelines.html.- Step 2: Save Your Plugin to the Plugin Directory on the Nagios Server Machine
The default location is
/usr/lib64/nagios/plugins/
.- Step 3: Define the Command to Execute the New Plug-In
In
/etc/nagios/objects
find and open thehadoop-commands.cfg
file with a text editor. Add the following information to the list:
define command{ command_name my_command_name command_line $USER1$/my_command_name.sh $HOSTADDRESS$ $ARG1$ $ARG2$
where:
Table 3.1. Define Command
Variable Name | Variable Definition |
---|---|
command_name | The comand name |
command_line
| The command with arguments used to launch the command |
Notice that the command_line
in the sample includes standard Nagios variables
like $ARG1$
and $HOSTADDRESS$
. The variable $USER1$
is the
Nagios plugin directory path. Write the full command with arguments down for later use.
- Step 4: Decide Which Hostgroup Your Plugin Should Check
In
/etc/nagios/objects
find and open thehadoop-hostgroups.cfgfile
. Write down thehostgroup_name
that corresponds to the set of hosts your check should run against.- Step 5: Decide Which Servicegroup Your Plugin Belongs To
In
/etc/nagios/objects
find and open thehadoop-servicegroups.cfg
file. Write down theservicegroup
_name that is most applicable, creating your own if necessary. These service groups are helpful in enabling/disabling multiple alerts as a unit using the Nagios Web UI.- Step 6: Define the Alert Entry
In
/etc/nagios/objects
find and open thehadoop-services.cfg
file. Create a service entry like the following and add it to the list:
define service { hostgroup_name nagios-server use hadoop-service service_description NAGIOS::Nagios status log staleness servicegroups NAGIOS check_command check_nagios!10!/var/ nagios/status.dat!/usr/bin/nagios normal_check_interval 5 retry_check_interval 0.5 max_check_attempts 2 }
where:
Table 3.2. Define Service
Variable Name | Variable Definition |
---|---|
hostgroup | The name is the name you wrote down in Step 4 |
use | Indicates that this service inherits from hadoop-service . All
services inherit from hadoop-service . |
service_description | The name of the service/alert[a] |
servicegroups | The group name you wrote down in Step 5 |
check_command
| The command_line you entered in the hadoop-commands.cfg file in Step 3[b] |
normal_check_interval | The number of minutes between regularly scheduled checks on the host as long as the check does not change the state |
retry_check_interval | The number of minutes between “retries”[c] |
max_check_attempts
| The maximum number of retry attempts[d] |
[a] Follow the convention of using one of the predefined Hadoop service names as a prefix, followed by double colon and then a short description of the new alert. The service name prefix is used to determine under which Service the alert appears. The list of predefined Hadoop services names includes NAMENODE, HDFS, JOBTRACKER, MAPREDUCE, HBASEMASTER, HBASE, ZOOKEEPER, HIVE-METASTORE, OOZIE, and TEMPLETON. [b] In this format arguments are separated by the “!” character. [c] When a service changes state, Nagios can confirm that state change by retrying the check multiple times. This retry interval can be different than the original check interval. [d] Usually when the state of a service changes, this change is considered “soft” until multiple retries confirm it. After the state change is confirmed, it is considered “hard”. This value indicates the number of attempts that must be made to confirm this state as “hard” and thus to display it. |
- Step 7: Restart the Server to See the New Alerts
When you have finished making your edits, restart the Nagios service using following command as
root
user:
service nagios restart