Troubleshooting an Abnormal Job Duration

Identify areas of risk from jobs running on your cluster that complete within an unusual time period.

Describes how to locate and troubleshoot an abnormal job duration.

Steps with examples are included that explain how to further investigate and troubleshoot the cause of an abnormal job duration.

In a supported browser, log in to the Workload XM web UI by doing the following:
1. In the web browser URL field, enter the Workload XM URL that you were given by your system administrator and press Enter.
2. When the Workload XM Log in page opens, enter your Workload XM user name and password access credentials.
3. Click Log in.
In the Clusters page do one of the following:
- In the Search field, enter the name of the cluster whose workloads you want to analyze.
- From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
From the Engine column in the Usage Analysis chart, click an engine whose jobs you wish to analyze.
Display the number of jobs with an abnormal duration that executed within the selected time period by clicking the Abnormal Duration health check bar in the Suboptimal Jobs chart widget.
The Job page opens, listing all the jobs that have triggered the Abnormal Duration Health check.

tip
Any jobs that fall outside of their baseline are counted. You can hover over each bar within the chart to view how many jobs triggered each health check.
To specify a specific amount of time in which the job either ran less than or more than the Health check rule, from the Duration list, either select a predefined time duration or select Customize and enter the minimum or maximum time period.
To view more details about a job, from the Job column, select a job's name and then click the Health Checks tab.
The Baseline Health checks are displayed.
To display more information about the job's duration, from the Baseline column, select Duration.
For example, as shown in the following image the job finished much slower than the baseline:
To display more information about the length of time the processing tasks took within a job, from the Baseline column, select Task Duration.
For example, as shown in the following image, a particular task took longer to complete than expected:
To display more information about the abnormal task, click the task, which opens the Task Details panel.
In the following example, the Task Details show that the abnormal task took significantly more time to complete the garbage collection process than the average:
To display more information about the garbage collection for this abnormal task example, from the Baseline column, select Task GC Time.
In the Task GC Time page, click the Execution Details tab and then click one of the MapReduce stages:
In the Summary panel, click View Configurations and then locate the configuration for the garbage collection by entering the MapReduce memory configuration property name in the Search field:

The configuration setting for the garbage collection is 1024. This value could be causing the mapper JVM to have insufficient memory and triggering too many garbage collection processes. Increasing the value will improve cluster performance and remove this task as a potential risk.