Troubleshooting Failed Jobs

Steps for troubleshooting incomplete jobs running on your cluster.

Describes how to locate and troubleshoot jobs that have failed to complete.

The following procedure uses examples from a Spark engine to explain how to further investigate and troubleshoot the root cause of an uncompleted job

  1. Verify that you are logged in to the Workload XM web UI.
    1. In the URL field of a supported web browser, enter the Workload XM URL that you were given by your system administrator and press Enter.
    2. When the Workload XM Log in page opens, enter your Workload XM user name and password access credentials.
    3. Click Log in.
  2. In the Clusters page do one of the following:
    • In the Search field, enter the name of the cluster whose workloads you want to analyze.
    • From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
  3. From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
  4. In the Usage Analysis chart widget, notice which engine's are displaying Failed jobs and then from the Trend widget, select the tab of an engine whose failed jobs you wish to analyze and then click its Total Jobs value.
    The engine's Jobs page opens.
  5. From the Health Check list, select Failed to Finish, which filters the list to display a list of jobs that did not complete.
  6. To view more details about why a job failed to complete, from the Job column select a job's name. The job's page opens displaying information about the job you selected and where the failure happened.

  7. From the Failures section, in the Diagnostic Information column, click More.
    The Diagnostic Information dialog box opens, which describes more details about why the job aborted. In the following example's case, the job was aborted whilst writing rows due to an out of bounds java exception:

  8. Click Close, to close the dialog box.
  9. To display more information about the stage where the job failed, in this case the Stage-2 process, in the Failing from column, click the stage's link. Or select the Execution Details tab and then click the failed stage's link.
    In the following example's Summary panel, it shows that Task 0 was attempted 4 times:

  10. To display more information about all the failed attempts, in the Summary panel, click the Failed task value.
    In the following example, the job aborted when Task 0 was writing rows. To understand more about what triggered the SparkException error message and to further troubleshoot the root cause, you can open the associated log file by clicking Full error log.