Troubleshooting Abnormal Job Durations

Identify areas of risk from unusual job durations running on your cluster.

Describes how to locate and troubleshoot abnormal job duration time periods.

Steps with examples are included that explain how to further investigate and troubleshoot the cause of an abnormal job duration time period.

  1. In a supported browser, log in to Workload XM.
  2. In the Clusters page do one of the following:
    • In the Search field, enter the name of the cluster whose workloads you want to analyze.
    • From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
  3. From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
  4. Display the number of jobs with an abnormal duration that executed within the selected period by clicking the Abnormal Duration health check bar in the Suboptimal Jobs graph.


    The Job page opens, listing all the jobs that have triggered the Abnormal Duration Health check.
  5. To specify a specific amount of time in which the job either ran less than or more than the Health check rule, from the Duration list, either select a predefined time duration or select Customize and enter the minimum or maximum time period.


  6. To view more details about a job, from the Job column, select a job's name and then click the Health Checks tab.
    The Baseline Health checks are displayed.
  7. To display more information about the job's duration, from the Baseline column, select Duration.
    The following reveals that for this example the job finished much slower than the baseline:


  8. To display more information about the length of time the processing tasks took within a job, from the Baseline column, select Task Duration.
    The following reveals that for this example a particular task took an abnormal amount of time to finish:


  9. To display more information about the abnormal task, click the abnormal task, which opens the Task Details panel.
    The following reveals that for this example the garbage collection for Task 160 is taking significantly more time than the average task:


  10. To display more information about the garbage collection for this job example, from the Baseline column, select Task GC Time.
  11. In the Task GC Time page, click the Execution Details tab and then click one of the MapReduce stages:


  12. In the Summary panel, click View Configurations and then locate the configuration for the garbage collection by entering part of the MapReduce memory configuration property name in the Search field:


    The configuration for the garbage collection reveals that the setting is 1024, which might be causing the mapper JVM to have insufficient memory as well as triggering too many garbage collections. Increasing this number will improve cluster performance and remove this task as a potential risk.