Troubleshooting Failed Jobs
Steps for troubleshooting incomplete jobs running on your cluster.
Steps with examples are included that describe how to further investigate and troubleshoot the root cause of an uncompleted job.
-
In a supported browser, log in to the Workload XM web UI by doing
the following:
- In the web browser URL field, enter the Workload XM URL that you were given by your system administrator and press Enter.
- When the Workload XM Log in page opens, enter your Workload XM user name and password access credentials.
- Click Log in.
-
In the Clusters page do one of the following:
- In the Search field, enter the name of the cluster whose workloads you want to analyze.
- From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
- From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
-
From the Trend widget, select the tab of an engine whose
jobs you want to analyze and then click its Total Jobs
value.
The engine's Jobs page opens.
-
From the Health Check list, select Failed to
Finish, which filters the list to display a list of jobs that
did not complete.
-
To view more details about why a job failed to complete, from the
Job column, select a job's name and then click the
Health Checks tab.
The Baseline Health checks are displayed.
-
From the Health Checks panel, select the
Failed to Finish health check.
For example, as shown in the following image, the failure occurred in the Map Stage of the job process:
- To display more information about the Map Stage process, click Map Stage and then from the Map Stage panel, click Execution Details.
-
To display all the failed tasks, in the Summary panel,
click the Failed value:
-
To display the reason for a task's failure, select and expand its error
message.
For example, as shown in the following image, the task was not completed because it was stopped. To understand what triggered the
Task KILL is received. Killing attempt!
error message and to further troubleshoot the root cause, open the associated log file by clicking Logs.