3.2.10. Common Server-Side Issues

Resource Manager or Node Manager: Fails to Start or Crashes

Symptoms may include:

Process appears to start, but then disappears from the process list.
Node Manager cannot bind to interface.
Kernel Panic, Halt.

Potential Root Cause: Existing Process Bound to Port

Troubleshooting Steps:

Examine bound ports to ensure no other process has already bound.

Resolution Steps:

Resolve the port conflict before attempting to restart the Resource Manager/Node Manager.

Information to Collect:

List of bound interfaces/ports and the process.
Resource Manager log.

Potential Root Cause: Incorrect File Permissions

Troubleshooting Steps:

Verify that all Hadoop file system permissions are set properly.
Verify the Hadoop configurations.

Resolution Steps:

Follow the procedures for handling failure due to file permissions (see Hortonworks KB Solutions/Articles).
Fix any incorrect configuration.

Information to Collect:

Dump of file system permissions, ownership, and flags -- by looking in the configuration value in the yarn-site.xml file for the yarn.nodemanager.local-dirs property. In this case, it has a value of “/hadoop/yarn/local”. From the command line, run:
```
ls -lR /hadoop/yarn/local
```
Resource Manager log.
Node Manager log.

Potential Root Cause: Incorrect Name-to-IP Resolution

Troubleshooting Steps:

Verify that the name/IP resolution is correct for all nodes in the cluster.

Resolution Steps:

Fix any incorrect configuration.

Information to Collect:

Local hosts file for all hosts on the system (/etc/hosts).
Resolver configuration (/etc/resolv.conf).
Network configuration (/etc/sysconfig/network-scripts/ifcfg-ethX where X = number of interface card).

Potential Root Cause: Java Heap Space Too Low

Troubleshooting Steps:

Examine the heap space property in yarn-env.sh
Examine the settings in Ambari cluster management.

Resolution Steps:

Adjust the heap space property until the Resource Manager resumes running.

Information to Collect:

yarn-env.sh from cluster.
Screen-shot of Ambari cluster management mapred settings screen.
Resource Manager log.
Node Manager log.

Potential Root Cause: Permissions Not Set Correctly on Local File System

Troubleshooting Steps:

Examine the permissions on the various directories on the local file system.
Verify proper ownership (yarn/mapred for MapReduce directories and hdfs for HDFS directories).

Resolution Steps:

Use the chmod command to change the permissions of the directories to 755.
Use the chown command to assign the directories to the correct owner (hdfs or yarn/mapred).
Relaunch the Hadoop daemons using the correct user.

Information to Collect:

core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml
Permissions listing for the directories listed in the above configuration files.

Potential Root Cause: Insufficient Disk Space

Troubleshooting Steps:

Verify that there is sufficient space on all system, log, and HDFS partitions.
Run the df -k command on the Name/DataNodes to verify that there is sufficient capacity on the disk volumes used for storing NameNode or HDFS data.

Resolution Steps:

Free up disk space on all nodes in the cluster.
-OR-
Add additional capacity.

Information to Collect:

Core dumps.
Linux command: last (history).
Dump of file system information.
Output of df -k command.

Potential Root Cause: Reserved Disk Space is Set Higher than Free Space

Troubleshooting Steps:

In hdfs-site.xml, check that the value of the dfs.datanode.du.reserved property is less than the available free space on the drive.

Resolution Steps:

Configure an appropriate value, or increase free space.

Information to Collect:

HDFS configuration files.

Node Java Process Exited Abnormally

Potential Root Cause: Improper Shutdown

Troubleshooting Steps:

Investigate the OS history and the Hadoop audit logs.
Verify that no edit log or fsimage corruption occurred.

Resolution Steps:

Investigate the cause, and take measures to prevent future occurrence.

Information to Collect:

Hadoop audit logs.
Linux command: last (history).
Linux user command history.

Potential Root Cause: Incorrect Memory Configuration

Troubleshooting Steps:

Verify values in configuration files.
Check logs for stack traces -- out of heap space, or similar.

Resolution Steps:

Fix configuration values and restart job/Resource Manager/Node Manager.

Information to Collect:

Resource Manager.
Node Manager.
MapReduce v2 configuration files.

Node Manager Denied Communication with Resource Manager

Potential Root Cause: Hostname in Exclude File or Doesn't Exists in Include File

Troubleshooting Steps:

Verify the contents of the files referenced in the yarn.resourcemanager.nodes.exclude-path property or the yarn.resourcemanager.nodes.include-path property.
Verify that the host for this Node Manager is not being decommissioned.

Resolution Steps:

If hostname for the Node Manager is in the file, and it is not meant for decommissioning, remove it.

Information to Collect:

Files that are pointed to by the yarn.resourcemanager.nodes.exclude-path or yarn.resourcemanager.nodes.include-path properties.

Potential Root Cause: Node was Decommissioned and/or Reinserted into the Cluster

Troubleshooting Steps:

This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.

Potential Root Cause: Resource Manager is Refusing the Connection

Troubleshooting Steps:

Follow the steps (described previously) to ensure that the NameNode has started and is accepting requests.

Resolution Steps:

Ensure that the DataNode is not in the "exclude" list.
Ensure that the DataNode host is in the "include" list.

Information to Collect:

NameNode slaves file.
NameNode hosts.deny file (or the file specified as the blacklist in HDFS configuration).
NameNode hosts.allow file (or the file specified as the whitelist in HDFS configuration).
HDFS Configuration.

Potential Root Cause: NameNode is Refusing the Connection

Troubleshooting Steps:

Follow the steps (described previously) to ensure that the NameNode has started and is accepting requests.

Resolution Steps:

Ensure that the DataNode is not in the "exclude" list.

Information to Collect:

NameNode slaves file.
NameNode hosts.deny file (or the file specified as the blacklist in HDFS configuration).
NameNode hosts.allow file (or the file specified as the whitelist in HDFS configuration).
HDFS Configuration.

Error: Could Only be Replicated to x Nodes, Instead of n

Potential Root Cause: At Least One DataNode is Nonfunctional

Troubleshooting Steps:

This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.

Potential Root Cause: One or More DataNodes Are Out of Space on Their Currently Available Disk Drives

Troubleshooting Steps:

This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.

Legal notices