Troubleshooting Abnormal Job Durations

Use Workload Manager to find and troubleshoot slow-running jobs to identify areas of risk in jobs running on your cluster.

  1. On the Data Engineering Summary page, click the time range in the upper right corner of the page and specify a time range you are interested in.



  2. In the Suboptimal Jobs graph, click the Abnormal Duration health check to view the number of jobs with an abnormal duration that executed within the selected time frame. Any jobs that fall outside of the baseline duration will be marked as slow. Hover over the graph to see how many jobs triggered each health check.



  3. After clicking Abnormal Duration in the Suboptimal Jobs graph, a list of all slow jobs displays on the Data Engineering Jobs page. These jobs have all triggered the Duration health check:



    From the Duration drop-down list, select a duration range or select Customize to enter a custom minimum or maximum duration to view any jobs that meet that duration criteria.

  4. Click on the Job name to view more detailed information, and then on the Jobs detail page, click on the Health Checks tab. Under the Duration health check, you can see that this job finished much slower than the normal duration:



    To further investigate, click the Task Duration health check.

  5. After clicking Task Duration, you can see that this job contains a task that took an abnormal amount of time to finish:



    Click the task to view further details about it.

  6. After clicking the task, the Task Details pane displays details about its run. In the following example, garbage collection is taking significantly more time than the average task:



    Click Task GC Time to view more information about garbage collection for this job.

  7. On the Task GC Time page, click the Execution Details tab, and then click one of the MapReduce stages:



  8. In the MapReduce stage Summary page, click View Configurations, and then enter part of the MapReduce memory configuration property name to search for and view the configuration for garbage collection:



    In the above case, setting this property to 1024 might be causing the mapper JVM to have insufficient memory, which triggers too frequent garbage collection. Increasing this number might improve performance on your cluster.