Troubleshooting Failed Jobs
Steps for troubleshooting uncompleted workloads running on your cluster.
Steps with examples are included that explain how to further investigate and troubleshoot the root cause of an uncompleted job.
- In a supported browser, log in to Workload XM.
In the Clusters page do one of the following:
- In the Search field, enter the name of the cluster whose workloads you want to analyze.
- From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
- From the navigation panel, select Jobs under Data Engineering.
From the Health Check list in the Jobs page, select
Failed to Finish, which filters the list to display a
list of jobs that did not complete.
To view more details about why the job failed to complete, from the
Job column, select a job's name and then click the
Health Checks tab.
The Baseline Health checks are displayed.
From the Health Checks panel, select the
Failed to Finish health check.
The following reveals that for this example the failure occurred in the Map Stage of the job process:
- To display more information about the Map Stage process, click Map Stage and then from the Map Stage panel, click Execution Details.
To see all the failed tasks, in the Summary panel, click
on the number value in the Failed field:
For each failed attempt, display the error message by selecting each task.
For this example, the following
Task KILL is received. Killing attempt!error message reveals that for this example more information is required to answer why a KILL task was received. To further troubleshoot the root cause, understanding what triggered the error is required. To investigate further, open the associated log file by clicking Logs.