Resource Manager or Node Manager: Fails to Start or Crashes
Symptoms may include:
Process appears to start, but then disappears from the process list.
Node Manager cannot bind to interface.
Kernel Panic, Halt.
- Potential Root Cause: Existing Process Bound to Port
Troubleshooting Steps:
Examine bound ports to ensure no other process has already bound.
Resolution Steps:
Resolve the port conflict before attempting to restart the Resource Manager/Node Manager.
Information to Collect:
List of bound interfaces/ports and the process.
Resource Manager log.
- Potential Root Cause: Incorrect File Permissions
Troubleshooting Steps:
Verify that all Hadoop file system permissions are set properly.
Verify the Hadoop configurations.
Resolution Steps:
Follow the procedures for handling failure due to file permissions (see Hortonworks KB Solutions/Articles).
Fix any incorrect configuration.
Information to Collect:
Dump of file system permissions, ownership, and flags -- by looking in the configuration value in the
yarn-site.xml
file for theyarn.nodemanager.local-dirs
property. In this case, it has a value of “/hadoop/yarn/local”. From the command line, run:ls -lR /hadoop/yarn/local
Resource Manager log.
Node Manager log.
- Potential Root Cause: Incorrect Name-to-IP Resolution
Troubleshooting Steps:
Verify that the name/IP resolution is correct for all nodes in the cluster.
Resolution Steps:
Fix any incorrect configuration.
Information to Collect:
Local hosts file for all hosts on the system (
/etc/hosts
).Resolver configuration (
/etc/resolv.conf
).Network configuration (
/etc/sysconfig/network-scripts/ifcfg-ethX
where X = number of interface card).
- Potential Root Cause: Java Heap Space Too Low
Troubleshooting Steps:
Examine the heap space property in
yarn-env.sh
Examine the settings in Ambari cluster management.
Resolution Steps:
Adjust the heap space property until the Resource Manager resumes running.
Information to Collect:
yarn-env.sh
from cluster.Screen-shot of Ambari cluster management mapred settings screen.
Resource Manager log.
Node Manager log.
- Potential Root Cause: Permissions Not Set Correctly on Local File System
Troubleshooting Steps:
Examine the permissions on the various directories on the local file system.
Verify proper ownership (yarn/mapred for MapReduce directories and hdfs for HDFS directories).
Resolution Steps:
Use the
chmod
command to change the permissions of the directories to 755.Use the
chown
command to assign the directories to the correct owner (hdfs or yarn/mapred).Relaunch the Hadoop daemons using the correct user.
Information to Collect:
core-site.xml
,hdfs-site.xml
,mapred-site.xml
,yarn-site.xml
Permissions listing for the directories listed in the above configuration files.
- Potential Root Cause: Insufficient Disk Space
Troubleshooting Steps:
Verify that there is sufficient space on all system, log, and HDFS partitions.
Run the
df -k
command on the Name/DataNodes to verify that there is sufficient capacity on the disk volumes used for storing NameNode or HDFS data.
Resolution Steps:
Free up disk space on all nodes in the cluster.
-OR-
Add additional capacity.
Information to Collect:
Core dumps.
Linux command: last (history).
Dump of file system information.
Output of
df -k
command.
- Potential Root Cause: Reserved Disk Space is Set Higher than Free Space
Troubleshooting Steps:
In
hdfs-site.xml
, check that the value of thedfs.datanode.du.reserved
property is less than the available free space on the drive.
Resolution Steps:
Configure an appropriate value, or increase free space.
Information to Collect:
HDFS configuration files.
Node Java Process Exited Abnormally
- Potential Root Cause: Improper Shutdown
Troubleshooting Steps:
Investigate the OS history and the Hadoop audit logs.
Verify that no edit log or fsimage corruption occurred.
Resolution Steps:
Investigate the cause, and take measures to prevent future occurrence.
Information to Collect:
Hadoop audit logs.
Linux command: last (history).
Linux user command history.
- Potential Root Cause: Incorrect Memory Configuration
Troubleshooting Steps:
Verify values in configuration files.
Check logs for stack traces -- out of heap space, or similar.
Resolution Steps:
Fix configuration values and restart job/Resource Manager/Node Manager.
Information to Collect:
Resource Manager.
Node Manager.
MapReduce v2 configuration files.
Node Manager Denied Communication with Resource Manager
- Potential Root Cause: Hostname in Exclude File or Doesn't Exists in Include File
Troubleshooting Steps:
Verify the contents of the files referenced in the
yarn.resourcemanager.nodes.exclude-path
property or theyarn.resourcemanager.nodes.include-path
property.Verify that the host for this Node Manager is not being decommissioned.
Resolution Steps:
If hostname for the Node Manager is in the file, and it is not meant for decommissioning, remove it.
Information to Collect:
Files that are pointed to by the
yarn.resourcemanager.nodes.exclude-path
oryarn.resourcemanager.nodes.include-path
properties.
- Potential Root Cause: Node was Decommissioned and/or Reinserted into the Cluster
Troubleshooting Steps:
This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.
- Potential Root Cause: Resource Manager is Refusing the Connection
Troubleshooting Steps:
Follow the steps (described previously) to ensure that the NameNode has started and is accepting requests.
Resolution Steps:
Ensure that the DataNode is not in the "exclude" list.
Ensure that the DataNode host is in the "include" list.
Information to Collect:
NameNode slaves file.
NameNode
hosts.deny
file (or the file specified as the blacklist in HDFS configuration).NameNode
hosts.allow
file (or the file specified as the whitelist in HDFS configuration).HDFS Configuration.
- Potential Root Cause: NameNode is Refusing the Connection
Troubleshooting Steps:
Follow the steps (described previously) to ensure that the NameNode has started and is accepting requests.
Resolution Steps:
Ensure that the DataNode is not in the "exclude" list.
Information to Collect:
NameNode slaves file.
NameNode
hosts.deny
file (or the file specified as the blacklist in HDFS configuration).NameNode
hosts.allow
file (or the file specified as the whitelist in HDFS configuration).HDFS Configuration.
Error: Could Only be Replicated to x Nodes, Instead of n
- Potential Root Cause: At Least One DataNode is Nonfunctional
Troubleshooting Steps:
This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.
- Potential Root Cause: One or More DataNodes Are Out of Space on Their Currently Available Disk Drives
Troubleshooting Steps:
This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.