Job Fails to Start
- Symptom: Exception When Job Submitted, Potential Root Cause: Mistake in the Job's User Code
Troubleshooting Steps:
Examine Node Manager/Resource Manager logs and task-logs to find the exact exception.
Resolution Steps:
Examine the stack-trace for the thrown exception.
Examine the user code to see if you can spot the error.
Information to Collect:
Resource Manager log.
Node Manager log.
The exception trace that the user has mentioned, and the task logs.
If possible, get at least a snippet of Java code from the area where the exception was thrown.
- Symptom: "No Class Def Found" or Similar Exception When Trying to Start Job, Potential Root Cause - 1: Job's .jar File -- or Other .jar File -- Not on Classpath
Troubleshooting Steps:
Verify that the exception is ClassNotFound, NoSuchMethodError, or a similar exception.
Resolution Steps:
Find the .jar file that contains the missing class and add it to the classpath.
Information to Collect:
The entire command used to submit the job.
The stack-trace from the Node Manager logs.
- Potential Root Cause - 2: Main Class or Method of the Job Code is not "Public Static"
Troubleshooting Steps:
Examine the code for the main MRv2 class.
Resolution Steps:
Set access modifiers to "public static"
Recompile and re-test.
Information to Collect:
The exact exception thrown by Hadoop.
The job source code.
Job Seems to Hang in Setup
- Symptom: Job Seems to Hang and Node Manager Becomes "Blacklisted", Potential Root Cause: Too Many Allowed Slots Configured for the System Memory on the Node
Troubleshooting Steps:
Verify the amount of system memory.
Calculate the required memory for each configured Container.
Take into account any other processes running on the node.
Resolution Steps:
Add all of the above. If the total is greater than the total available on the node, you will need to reduce the amount configured in the Container properties.
- Symptom: Job Seems to Hang Without "Blacklisting", Potential Root Cause: No Node Managers Currently Available
Troubleshooting Steps:
Verify the number of available MRv2 tasks available by looking at:
<Resource Manager host>:8088/cluster/nodes
Resolution Steps:
Wait until more Node Managers become available, then see if the job runs.
Information to Collect:
None until the job actually fails to run, then troubleshoot based on the failure symptom.