3.2.11. Common Client-Side Issues - Hortonworks Data Platform

Symptom: Exception When Job Submitted, Potential Root Cause: Mistake in the Job's User Code

Troubleshooting Steps:

Examine Node Manager/Resource Manager logs and task-logs to find the exact exception.

Resolution Steps:

Examine the stack-trace for the thrown exception.
Examine the user code to see if you can spot the error.

Information to Collect:

Resource Manager log.
Node Manager log.
The exception trace that the user has mentioned, and the task logs.
If possible, get at least a snippet of Java code from the area where the exception was thrown.

Symptom: "No Class Def Found" or Similar Exception When Trying to Start Job, Potential Root Cause - 1: Job's .jar File -- or Other .jar File -- Not on Classpath

Troubleshooting Steps:

Verify that the exception is ClassNotFound, NoSuchMethodError, or a similar exception.

Resolution Steps:

Find the .jar file that contains the missing class and add it to the classpath.

Information to Collect:

The entire command used to submit the job.
The stack-trace from the Node Manager logs.

Potential Root Cause - 2: Main Class or Method of the Job Code is not "Public Static"

Troubleshooting Steps:

Examine the code for the main MRv2 class.

Resolution Steps:

Set access modifiers to "public static"
Recompile and re-test.

Information to Collect:

The exact exception thrown by Hadoop.
The job source code.

Symptom: Job Seems to Hang and Node Manager Becomes "Blacklisted", Potential Root Cause: Too Many Allowed Slots Configured for the System Memory on the Node

Troubleshooting Steps:

Verify the amount of system memory.
Calculate the required memory for each configured Container.
Take into account any other processes running on the node.

Resolution Steps:

Add all of the above. If the total is greater than the total available on the node, you will need to reduce the amount configured in the Container properties.

Symptom: Job Seems to Hang Without "Blacklisting", Potential Root Cause: No Node Managers Currently Available

Troubleshooting Steps:

Verify the number of available MRv2 tasks available by looking at:
<Resource Manager host>:8088/cluster/nodes

Resolution Steps:

Wait until more Node Managers become available, then see if the job runs.

Information to Collect:

None until the job actually fails to run, then troubleshoot based on the failure symptom.