Operating System Information
The following commands will provide the Linux Kernel version and type. With this information, you can determine if HDP is running on a supported platform.
Command:
uname -a
Example:
$ uname -a Linux test63.localdomain 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Command:
cat /proc/version
Example:
$ cat /proc/version Linux version 2.6.32-279.el6.x86_64 (mockbuild@c6b9.bsys.dev.centos.org) (gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) ) #1 SMP Fri Jun 22 12:19:21 UTC 2012
Command:
cat /etc/*-release
Example:
$ cat /etc/*-release CentOS release 6.3 (Final) CentOS release 6.3 (Final) CentOS release 6.3 (Final)
Determine Installed Software
This information is helpful when troubleshooting performance-related issues, or when there is unexpected behavior occurring on one specific machine. One example would be a MapReduce job that suddenly starts running for 20 minutes rather than the expected 1 minute. The following command does not list any tarball-type installations, so you should keep in mind the possibility that some programs may have been installed outside of the system package manager.
Command:
rpm -qa
Example:
# rpm -qa | egrep "hadoop|yarn" hadoop-hdfs-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-mapreduce-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-lzo-native-0.5.0-1.x86_64 hadoop-mapreduce-historyserver-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-lzo-0.5.0-1.x86_64 hadoop-yarn-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-libhdfs-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-yarn-resourcemanager-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-client-2.2.0.2.0.6.0-76.el6.x86_64 hadoop-yarn-nodemanager-2.2.0.2.0.6.0-76.el6.x86_64
Detect Running Processes
This information is helpful when troubleshooting performance-related issues, or when there is unexpected behavior occurring on one specific machine.
Command:
ps -aux
Example:
$ ps -aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 19348 620 ? Ss Sep25 0:06 /sbin/init postgres 6705 0.0 0.0 214952 2936 ? Ss 09:18 0:00 postgres: mapred ambarirca 10.10.3.27(60031) idle root 3 0.0 0.0 0 0 ? S Sep25 0:00 [migration/0] root 4 0.0 0.0 0 0 ? S Sep25 0:07 [ksoftirqd/0]
Detect Java Running Processes
The command below lists the Java processes that are running on the machine. Since most Hadoop code is based on Java, using this command can also help verify that the Hadoop processes are running on a specific machine.
Note | |
---|---|
The
|
Command:
jps
Example output:
10528 Resource Manager 25185 Jps 9202 RunJar 10141 Bootstrap 8001 QuorumPeerMain 7357 NameNode 8358 HMaster 12474 HRegionServer 9605 RunJar 1921 Node Manager 8857 JobHistoryServer 5612 DataNode 17667 RunJar 2943 AmbariServer 11103 SecondaryNameNode
Show Open Files Linked to a Process ID
This information is helpful in determining which process has a lock on a specific file, such as issues where errors state that a file is locked, and hence a process cannot start because it cannot write to a file.
Command:
lsof -p <pid> | grep <file string name>
Example:
$ lsof -p 8857 | grep var java 8857 mapred 1w REG 253,0 2031 542470 /var/log/hadoop-mapreduce/mapred/mapred-mapred-historyserver-sandbox.out java 8857 mapred 2w REG 253,0 2031 542470 /var/log/hadoop-mapreduce/mapred/mapred-mapred-historyserver-sandbox.out java 8857 mapred 159w REG 253,0 95452 542286 /var/log/hadoop-mapreduce/mapred/mapred-mapred-historyserver-sandbox.log
Verifying Well-formed XML
The following command can help determine if a configuration file in XML format is well-formed. If the XML file is well-formed, it will simply be opened. If there are problems with the file, a list of errors will be displayed. This command can help uncover any syntax errors that might occur as a result of manually editing configuration files in Hadoop. The example below shows the list of errors returned when an XML file is not well-formed.
Command:
xmllint <xml file>
Example:
$ xmllint ./hdfs-site.xml ./hdfs-site.xml:187: parser error : Opening and ending tag mismatch: property line 6 and configuration </configuration> ^ ./hdfs-site.xml:188: parser error : Premature end of data in tag property line 3 ^ ./hdfs-site.xml:188: parser error : Premature end of data in tag configuration line 2 ^
Detect Auto-start Processes
Ideally the cluster administrator should have this information. Information about auto-start processes can help in determining why a certain behavior is specific to a machine. For instance, it is possible that a process that auto-starts on boot-up is preventing one of the HDP components to launch due to a port conflict after rebooting a node. Below is the command that returns a list of cron jobs.
Command:
for user in $(cut -f1 -d: /etc/passwd); do crontab -u $user -l; done
Example:
$ for user in $(cut -f1 -d: /etc/passwd); do crontab -u $user -l; done no crontab for root no crontab for bin no crontab for daemon no crontab for adm no crontab for lp no crontab for sync no crontab for shutdown no crontab for halt no crontab for mail
Get a List of All Mounts on a Machine
Gathering this information will help in determining what drives are actually mounted or available for use on the node. The following commands are helpful in determining if the system is hitting any storage limitations. Unexpected behavior can occur when disks are full.
Command:
df
Example:
$ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup-lv_root 11272464 4729432 5970412 45% / tmpfs 961928 272 961656 1% /dev/shm /dev/sda1 495844 37433 432811 8% /boot
Command:
cat /etc/fstab
Example:
root@a2nn:~> cat /etc/fstab # # /etc/fstab # Created by anaconda on Wed Mar 20 15:03:22 2013 # # Accessible filesystems, by reference, are maintained under '/dev/disk' # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info # /dev/mapper/vg_a2nn-lv_root / ext4 defaults 1 1 UUID=8bbdbae7-9cb8-4b66-af1c-4f904f047501 /boot ext4 defaults 1 2 /dev/mapper/vg_a2nn-lv_swap swap swap defaults 0 0 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 root@a2nn:~>
Operating System Log Files
These files can help in determining if a machine was rebooted or shut down at a particular time. The log files can help determine why some HDP services were not working or not operational at a specific time.
/var/log/messages
Contains global system messages, including messages that are logged during system start-up.
/var/log/audit/audit.log
This file is only available if the
/etc/init.d/auditd
daemon has been started. To check status, execute/etc/init.d/auditd status
. This file can be used to check which user executed a command at a particular time.
Hardware Information
The following commands provide information about the hardware components installed on the machine. This will help in isolating issues related to hardware.
The following command lists information about the PCI buses and devices in the system:
lspci
Example:
$ lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:02.0 VGA compatible controller: InnoTek Systemberatung GmbH VirtualBox Graphics Adapter 00:03.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE] (rev 10) 00:04.0 System peripheral: InnoTek Systemberatung GmbH VirtualBox Guest Service 00:05.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01) 00:06.0 USB controller: Apple Computer Inc. KeyLargo/Intrepid USB 00:07.0 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 00:08.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE] (rev 10) 00:0d.0 SATA controller: Intel Corporation 82801HM/HEM (ICH8M/ICH8M-E) SATA Controller [AHCI mode] (rev 02)
The following command returns CPU information for the machine:
cat /proc/cpuinfo
Example:
$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz stepping : 9 cpu MHz : 2283.256 cache size : 6144 KB
Network Information
This set of commands can be helpful when troubleshooting network issues.
The following command provides the IP address and validates that the network interfaces are up, and that there is an IP address tied to the interfaces. When dealing with issues involving communication between nodes, this is a good place to start.
ifconfig
Example:
$ ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:76:CD:33 inet addr:10.10.3.27 Bcast:10.10.3.255 Mask:255.255.254.0 inet6 addr: fe80::a00:27ff:fe76:cd33/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1217765 errors:0 dropped:0 overruns:0 frame:0 TX packets:336245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:308949876 (294.6 MiB) TX bytes:128725650 (122.7 MiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1609854 errors:0 dropped:0 overruns:0 frame:0 TX packets:1609854 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:619945138 (591.2 MiB) TX bytes:619945138 (591.2 MiB) virbr0 Link encap:Ethernet HWaddr 52:54:00:EB:5E:B7 inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:175 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:10172 (9.9 KiB)
The following command returns a list of the ports used within the system. This is helpful in determining if a specific port is already in use, and therefore another application cannot bind and listen on the port. This command can be useful when trying to isolate why certain HDP Master processes are not able to start up.
netstat -an
Example:
$ netstat -an Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 10.10.10.157:51111 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:199 0.0.0.0:* LISTEN tcp 0 0 10.10.10.157:50090 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:8010 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:8651 0.0.0.0:* LISTEN
You can use the following command to start, stop, and check the status of the firewall service on a CentOS or RHEL operating system. The firewall service status represents another possible reason why communication cannot be established between nodes.
service iptables [stop | status | start]
Example:
$service iptables status iptables: Firewall is not running.