3.2.11. Common Client-Side Issues

Job Fails to Start

Symptom: Exception When Job Submitted, Potential Root Cause:  Mistake in the Job's User Code

Troubleshooting Steps:

  • Examine Node Manager/Resource Manager logs and task-logs to find the exact exception.

Resolution Steps:

  • Examine the stack-trace for the thrown exception.

  • Examine the user code to see if you can spot the error.

Information to Collect:

  • Resource Manager log.

  • Node Manager log.

  • The exception trace that the user has mentioned, and the task logs.

  • If possible, get at least a snippet of Java code from the area where the exception was thrown.

Symptom: "No Class Def Found" or Similar Exception When Trying to Start Job, Potential Root Cause - 1:  Job's .jar File -- or Other .jar File -- Not on Classpath

Troubleshooting Steps:

  • Verify that the exception is ClassNotFound, NoSuchMethodError, or a similar exception.

Resolution Steps:

  • Find the .jar file that contains the missing class and add it to the classpath.

Information to Collect:

  • The entire command used to submit the job.

  • The stack-trace from the Node Manager logs.

Potential Root Cause - 2:  Main Class or Method of the Job Code is not "Public Static"

Troubleshooting Steps:

  • Examine the code for the main MRv2 class.

Resolution Steps:

  • Set access modifiers to "public static"

  • Recompile and re-test.

Information to Collect:

  • The exact exception thrown by Hadoop.

  • The job source code.

Job Seems to Hang in Setup

Symptom: Job Seems to Hang and Node Manager Becomes "Blacklisted", Potential Root Cause:  Too Many Allowed Slots Configured for the System Memory on the Node

Troubleshooting Steps:

  • Verify the amount of system memory.

  • Calculate the required memory for each configured Container.

  • Take into account any other processes running on the node.

Resolution Steps:

  • Add all of the above. If the total is greater than the total available on the node, you will need to reduce the amount configured in the Container properties.

Symptom: Job Seems to Hang Without "Blacklisting", Potential Root Cause:  No Node Managers Currently Available

Troubleshooting Steps:

  • Verify the number of available MRv2 tasks available by looking at:

    <Resource Manager host>:8088/cluster/nodes

Resolution Steps:

  • Wait until more Node Managers become available, then see if the job runs.

Information to Collect:

  • None until the job actually fails to run, then troubleshoot based on the failure symptom.