Troubleshooting Failed Jobs

Steps for troubleshooting uncompleted workloads running on your cluster.

Describes how to locate and troubleshoot jobs that have failed to complete.

Steps with examples are included that explain how to further investigate and troubleshoot the root cause of an uncompleted job.

  1. In a supported browser, log in to Workload XM.
  2. In the Clusters page do one of the following:
    • In the Search field, enter the name of the cluster whose workloads you want to analyze.
    • From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
  3. From the navigation panel, select Jobs under Data Engineering.
  4. From the Health Check list in the Jobs page, select Failed to Finish, which filters the list to display a list of jobs that did not complete.

  5. To view more details about why the job failed to complete, from the Job column, select a job's name and then click the Health Checks tab.
    The Baseline Health checks are displayed.
  6. From the Health Checks panel, select the Failed to Finish health check.
    The following reveals that for this example the failure occurred in the Map Stage of the job process:

  7. To display more information about the Map Stage process, click Map Stage and then from the Map Stage panel, click Execution Details.
  8. To see all the failed tasks, in the Summary panel, click on the number value in the Failed field:

  9. For each failed attempt, display the error message by selecting each task.
    For this example, the following Task KILL is received. Killing attempt! error message reveals that for this example more information is required to answer why a KILL task was received. To further troubleshoot the root cause, understanding what triggered the error is required. To investigate further, open the associated log file by clicking Logs.