This section describes installing and testing Nagios, a system that monitors Hadoop cluster components and issues alerts on warning and critical conditions.
On the host you have chosen for the Nagios server, install the RPMs:
For RHEL and CentOS:
yum -y install net-snmp net-snmp-utils php-pecl-json yum -y install wget httpd php net-snmp-perl perl-Net-SNMP fping nagios nagios- plugins nagios-www
For SLES:
zypper -n --no-gpg-checks install net-snmp zypper -n --no-gpg-checks install wget apache2 php php-curl perl-SNMP perl- Net-SNMP fping nagios nagios-plugins nagios-www
There are several configuration files that must be set up for Nagios.
From the HDP companion files, open the configuration_files folder and copy the files in the nagios folder to a temporary directory. The nagios folder contains two sub-folders, objects and plugins.
Create the following Nagios directories:
mkdir /var/nagios /var/nagios/rw /var/log/nagios /var/log/nagios/spool/checkresults /var/run/nagios
Change ownership on those directories to the Nagios user:
chown -R nagios:nagios /var/nagios /var/nagios/rw /var/log/nagios /var/log/nagios/spool/checkresults /var/run/nagios
Copy the contents of the objects folder into place:
cp <tmp-directory>/nagios/objects/*.* /etc/nagios/objects/
Copy the contents of the plugins folder into place:
cp <tmp-directory>/nagios/plugins/*.* /usr/lib64/nagios/plugins/
Choose a Nagios administrator password, for example, “admin”.
Set the password. Use the following command:
htpasswd -c -b /etc/nagios/htpasswd.users nagiosadmin admin
Open /etc/nagios/objects/contacts.cfg with a text editor.
Change the nagios@localhost value to the admin email address so it can receive alerts.
Open /etc/nagios/nagios.cfg with a text editor.
In the section OBJECT CONFIGURATION FILE(S), add the following:
# Definitions for hadoop servers cfg_file=/etc/nagios/objects/hadoop-commands.cfg cfg_file=/etc/nagios/objects/hadoop-hosts.cfg cfg_file=/etc/nagios/objects/hadoop-hostgroups.cfg cfg_file=/etc/nagios/objects/hadoop-services.cfg cfg_file=/etc/nagios/objects/hadoop-servicegroups.cfg
Change the command-file directive to /var/nagios/rw/nagios.cmd:
command_file=/var/nagios/rw/nagios.cmd
Open /etc/nagios/objects/hadoop-hosts.cfg with a text editor.
Create a "define host { … }" entry for each host in your cluster using the following format:
define host { alias @HOST@ host_name @HOST@ use linux-server address @HOST@ check_interval 0.25 retry_interval 0.25 max_check_attempts 4 notifications_enabled 1 first_notification_delay 0 # Send notification soon after #change in the hard state notification_interval 0 # Send the notification once notification_options d,u,r }
Replace "@HOST@" with the hostname.
Open /etc/nagios/objects/hadoop-hostgroups.cfg with a text editor.
Create host groups based on all the hosts and services you have installed in your cluster. Each host group entry should follow this format:
define hostgroup { hostgroup_name@NAME@ alias@ALIAS@ members@MEMBERS@ }
The parameters (such as @NAME@) are defined in the following table.
Table 25.1. Host Group Parameters
Parameter
Description
@NAME@
The host group name
@ALIAS@
The host group alias
@MEMBERS@
A comma-separated list of hosts in the group
The following table lists the core and monitoring host groups:
Table 25.2. Core and Monitoring Host Groups
Service
Component
Name
Alias
Members
All servers in the cluster
n/a
all-servers
All Servers
List all servers in the cluster
HDFS
NameNode
namenode
namenode
The NameNode host
HDFS
SecondaryNameNode
snamenode
snamenode
The Secondary NameNode host
MapReduce
JobTracker
jobtracker
jobtracker
The Job Tracker host
HDFS, MapReduce
Slaves
slaves
slaves
List all hosts running DataNode and TaskTrackers
Nagios
n/a
nagios-server
nagios-server
The Nagios server host
Ganglia
n/a
ganglia-server
ganglia-server
The Ganglia server host
The following table lists the ecosystem project host groups:
Table 25.3. Ecosystem Project Host Groups
Service
Component
Name
Alias
Members
HBase
Master
hbasemaster
hbasemaster
List the master server
HBase
Region
region-servers
region-servers
List all region servers
ZooKeeper
n/a
zookeeper-servers
zookeeper-servers
List all ZooKeeper servers
Oozie
n/a
oozie-server
oozie-server
The Oozie server
Hive
n/a
hiveserver
hiveserver
The Hive metastore server
WebHCat
n/a
webhcat-server
webhcat-server
The WebHCat server
Templeton
n/a
templeton-server
templeton-server
The Templeton server
Open /etc/nagios/objects/hadoop-services.cfg with a text editor. This file contains service definitions for the following services: Ganglia, HBase (Master and Region), ZooKeeper, Hive, Templeton, and Oozie.
Remove any service definitions for services you have not installed.
Replace the parameters @NAGIOS_BIN@ and @STATUS_DAT@ based on the operating system.
For RHEL and CentOS:
@STATUS_DAT@ = /var/nagios/status.dat
@NAGIOS_BIN@ = /usr/bin/nagios
For SLES:
@STATUS_DAT@ = /var/lib/nagios/status.dat
@NAGIOS_BIN@ = /usr/sbin/nagios
If you have installed Hive or Oozie services, replace the parameter @JAVA_HOME@ with the path to the Java home. For example, /usr/java/default.
Open /etc/nagios/objects/hadoop-commands.cfg with a text editor.
Replace the @STATUS_DAT@ parameter with the location of the Nagios status file. File location depends on your operating system.
For RHEL and CentOS:
/var/nagios/status.dat
For SLES:
/var/lib/nagios/status.dat
Open
/etc/nagios/objects/hadoop-commands.cfg
with a text editor.Add the following commands:
define command{ command_name check_templeton_status command_line $USER1$/check_wrapper.sh $USER1$/check_templeton_status.sh $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$ } define command{ command_name check_tcp_wrapper command_line $USER1$/check_wrapper.sh $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$ }
Follow these steps to validate your installation.
Validate the Nagios installation:
nagios -v /etc/nagios/nagios.cfg
Start the Nagios server and httpd:
/etc/init.d/nagios start/etc/init.d/httpd start
Confirm that the Nagios server is running:
/etc/init.d/nagios status
This should return:
nagios (pid #) is running...
To test Nagios Services, run the following command:
/usr/lib64/nagios/plugins/check_hdfs_capacity.php -h namenode_hostname -p 50070 -w 80% -c 90%
This should return:
OK: DFSUsedGB:<some#>, DFSTotalGB:<some#>
To test Nagios Access, browse to the Nagios server.
http://<nagios.server>/nagios
Login using the Nagios admin username (nagiosadmin) and password (see Set the Nagios Admin Password). Click on hosts to check that all hosts in the cluster are listed. Click on services to check that all of the Hadoop services are listed for each host.
Test Nagios alerts.
Login to one of your cluster DataNodes.
Stop the TaskTracker service:
su -l mapred -c "/usr/hdp/current/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/ conf stop tasktracker"
Validate that you received an alert at the admin email address, and that you have critical state showing on the console.
Start the TaskTracker service.
su -l mapred -c "/usr/hdp/current/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/ conf start tasktracker"
Validate that you received an alert at the admin email address, and that critical state is cleared on the console.