Troubleshooting an Abnormal Job Duration

Identify areas of risk from jobs running on your cluster that complete within an unusual time period.

Describes how to locate and troubleshoot an abnormal job duration.

The following procedure uses examples from a Spark engine to explain how to further investigate and troubleshoot the cause of an abnormal job duration.

  1. Verify that you are logged in to the Workload XM web UI.
    1. In the URL field of a supported web browser, enter the Workload XM URL that you were given by your system administrator and press Enter.
    2. When the Workload XM Log in page opens, enter your Workload XM user name and password access credentials.
    3. Click Log in.
  2. In the Clusters page do one of the following:
    • In the Search field, enter the name of the cluster whose workloads you want to analyze.
    • From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
  3. From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
  4. In the Usage Analysis chart, click the engine whose Failed column displays the number of jobs that did not complete.
  5. Depending on the engine you selected, in the engine's page that opens scroll down to either the Suboptimal Jobs or the Suboptimal Queries chart widget and click the Abnormal Duration health check bar.
    The Jobs or Queries page opens, listing all the jobs or queries that have triggered the Abnormal Duration Health check.

  6. Specify a specific amount of time in which the job either ran less than or more than the Health check rule by either selecting a predefined time duration or selecting Customize and enter the minimum or maximum time period from the Duration list.

  7. View more details about a job by selecting a job's name from the Job column and then clicking the Health Checks tab.
    The Baseline Health checks are displayed.
  8. Display more information about the job's duration by selecting Duration from the Baseline section. As shown in the image below.
    In the following example, the job finished much slower than the baseline duration, which is the aggregate calculated over multiple jobs.

  9. Compare and analyze this job against other baseline metrics by clicking View all metrics.
  10. Continue to analyze and search for probable causes by doing one or more of the following:
    • To display more information about the length of time the processing tasks took within a job, select Task Duration, which opens a panel that describes the health check, displays information about the possible causes, and lists recommended solutions.
      In the following example, issues arose during Stage-9 of the job due to poor parallelization. The Recommendation section lists items for you to complete that may resolve the problem and the specific outlier tasks that produced the unusual results:

    • To display more details about an outlier, click the outlier task, which opens the Task Details panel.
      In the following example, the Task Details show that the outlier task took significantly more time to complete compared to previous runs. In this case, 41 minutes as compared to 8 minutes:

    • To gain more insights about the task's duration, such as checking memory allocation, click the Execution Details tab and then in the Summary panel, click Configurations:

    • In the Configurations panel, click the Spark Properties tab and search for the memory configuration settings and their values. If memory is less than the recommended value, increasing its value will improve cluster performance: