Troubleshooting Failed Jobs

Steps for troubleshooting incomplete jobs running on your cluster.

Describes how to locate and troubleshoot jobs that have failed to complete.

Steps with examples are included that describe how to further investigate and troubleshoot the root cause of an uncompleted job.

  1. In a supported browser, log in to Workload XM.
  2. In the Clusters page do one of the following:
    • In the Search field, enter the name of the cluster whose workloads you want to analyze.
    • From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
  3. From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
  4. From the Trend widget, select the tab of an engine whose jobs you want to analyze and then click its Total Jobs value.
    The engine's Jobs page opens.
  5. From the Health Check list, select Failed to Finish, which filters the list to display a list of jobs that did not complete.


  6. To view more details about why a job failed to complete, from the Job column, select a job's name and then click the Health Checks tab.
    The Baseline Health checks are displayed.
  7. From the Health Checks panel, select the Failed to Finish health check.
    For example, as shown in the following image, the failure occurred in the Map Stage of the job process:


  8. To display more information about the Map Stage process, click Map Stage and then from the Map Stage panel, click Execution Details.
  9. To display all the failed tasks, in the Summary panel, click the Failed value:


  10. To display the reason for a task's failure, select and expand its error message.
    For example, as shown in the following image, the task was not completed because it was stopped. To understand what triggered the Task KILL is received. Killing attempt! error message and to further troubleshoot the root cause, open the associated log file by clicking Logs.