3.2.10. Common Server-Side Issues

Resource Manager or Node Manager: Fails to Start or Crashes

Symptoms may include:

  • Process appears to start, but then disappears from the process list.

  • Node Manager cannot bind to interface.

  • Kernel Panic, Halt.

Potential Root Cause:  Existing Process Bound to Port

Troubleshooting Steps:

  • Examine bound ports to ensure no other process has already bound.

Resolution Steps:

  • Resolve the port conflict before attempting to restart the Resource Manager/Node Manager.

Information to Collect:

  • List of bound interfaces/ports and the process.

  • Resource Manager log.

Potential Root Cause:  Incorrect File Permissions

Troubleshooting Steps:

  • Verify that all Hadoop file system permissions are set properly.

  • Verify the Hadoop configurations.

Resolution Steps:

  • Follow the procedures for handling failure due to file permissions (see Hortonworks KB Solutions/Articles).

  • Fix any incorrect configuration.

Information to Collect:

  • Dump of file system permissions, ownership, and flags -- by looking in the configuration value in the yarn-site.xml file for the yarn.nodemanager.local-dirs property. In this case, it has a value of “/hadoop/yarn/local”. From the command line, run:

    ls -lR /hadoop/yarn/local
  • Resource Manager log.

  • Node Manager log.

Potential Root Cause:  Incorrect Name-to-IP Resolution

Troubleshooting Steps:

  • Verify that the name/IP resolution is correct for all nodes in the cluster.

Resolution Steps:

  • Fix any incorrect configuration.

Information to Collect:

  • Local hosts file for all hosts on the system (/etc/hosts).

  • Resolver configuration (/etc/resolv.conf).

  • Network configuration (/etc/sysconfig/network-scripts/ifcfg-ethX where X = number of interface card).

Potential Root Cause:  Java Heap Space Too Low

Troubleshooting Steps:

  • Examine the heap space property in yarn-env.sh

  • Examine the settings in Ambari cluster management.

Resolution Steps:

  • Adjust the heap space property until the Resource Manager resumes running.

Information to Collect:

  • yarn-env.sh from cluster.

  • Screen-shot of Ambari cluster management mapred settings screen.

  • Resource Manager log.

  • Node Manager log.

Potential Root Cause:  Permissions Not Set Correctly on Local File System

Troubleshooting Steps:

  • Examine the permissions on the various directories on the local file system.

  • Verify proper ownership (yarn/mapred for MapReduce directories and hdfs for HDFS directories).

Resolution Steps:

  • Use the chmod command to change the permissions of the directories to 755.

  • Use the chown command to assign the directories to the correct owner (hdfs or yarn/mapred).

  • Relaunch the Hadoop daemons using the correct user.

Information to Collect:

  • core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml

  • Permissions listing for the directories listed in the above configuration files.

Potential Root Cause:  Insufficient Disk Space

Troubleshooting Steps:

  • Verify that there is sufficient space on all system, log, and HDFS partitions.

  • Run the df -k command on the Name/DataNodes to verify that there is sufficient capacity on the disk volumes used for storing NameNode or HDFS data.

Resolution Steps:

  • Free up disk space on all nodes in the cluster.

    -OR-

  • Add additional capacity.

Information to Collect:

  • Core dumps.

  • Linux command: last (history).

  • Dump of file system information.

  • Output of df -k command.

Potential Root Cause:  Reserved Disk Space is Set Higher than Free Space

Troubleshooting Steps:

  • In hdfs-site.xml, check that the value of the dfs.datanode.du.reserved property is less than the available free space on the drive.

Resolution Steps:

  • Configure an appropriate value, or increase free space.

Information to Collect:

  • HDFS configuration files.

Node Java Process Exited Abnormally

Potential Root Cause:  Improper Shutdown

Troubleshooting Steps:

  • Investigate the OS history and the Hadoop audit logs.

  • Verify that no edit log or fsimage corruption occurred.

Resolution Steps:

  • Investigate the cause, and take measures to prevent future occurrence.

Information to Collect:

  • Hadoop audit logs.

  • Linux command: last (history).

  • Linux user command history.

Potential Root Cause:  Incorrect Memory Configuration

Troubleshooting Steps:

  • Verify values in configuration files.

  • Check logs for stack traces -- out of heap space, or similar.

Resolution Steps:

  • Fix configuration values and restart job/Resource Manager/Node Manager.

Information to Collect:

  • Resource Manager.

  • Node Manager.

  • MapReduce v2 configuration files.

Node Manager Denied Communication with Resource Manager

Potential Root Cause:  Hostname in Exclude File or Doesn't Exists in Include File

Troubleshooting Steps:

  • Verify the contents of the files referenced in the yarn.resourcemanager.nodes.exclude-path property or the yarn.resourcemanager.nodes.include-path property.

  • Verify that the host for this Node Manager is not being decommissioned.

Resolution Steps:

  • If hostname for the Node Manager is in the file, and it is not meant for decommissioning, remove it.

Information to Collect:

  • Files that are pointed to by the yarn.resourcemanager.nodes.exclude-path or  yarn.resourcemanager.nodes.include-path properties.

Potential Root Cause:  Node was Decommissioned and/or Reinserted into the Cluster

Troubleshooting Steps:

  • This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.

Potential Root Cause:  Resource Manager is Refusing the Connection

Troubleshooting Steps:

  • Follow the steps (described previously) to ensure that the NameNode has started and is accepting requests.

Resolution Steps:

  • Ensure that the DataNode is not in the "exclude" list.

  • Ensure that the DataNode host is in the "include" list.

Information to Collect:

  • NameNode slaves file.

  • NameNode hosts.deny file (or the file specified as the blacklist in HDFS configuration).

  • NameNode hosts.allow file (or the file specified as the whitelist in HDFS configuration).

  • HDFS Configuration.

Potential Root Cause:  NameNode is Refusing the Connection

Troubleshooting Steps:

  • Follow the steps (described previously) to ensure that the NameNode has started and is accepting requests.

Resolution Steps:

  • Ensure that the DataNode is not in the "exclude" list.

Information to Collect:

  • NameNode slaves file.

  • NameNode hosts.deny file (or the file specified as the blacklist in HDFS configuration).

  • NameNode hosts.allow file (or the file specified as the whitelist in HDFS configuration).

  • HDFS Configuration.

Error: Could Only be Replicated to x Nodes, Instead of n

Potential Root Cause:  At Least One DataNode is Nonfunctional

Troubleshooting Steps:

  • This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.

Potential Root Cause:  One or More DataNodes Are Out of Space on Their Currently Available Disk Drives

Troubleshooting Steps:

  • This is a problem with HDFS. Refer to the HDFS Troubleshooting Guide.