Troubleshooting an abnormal job duration

Identify areas of risk from jobs running on your workload cluster that complete within an unusual time period.

Describes how to locate and troubleshoot an abnormal job duration.

Steps with examples from a Spark engine are included that explain how to further investigate and troubleshoot the cause of an abnormal job duration.

  1. Verify that you are logged in to the Cloudera Observability web UI.
    1. In a supported browser, log into the Cloudera Data Platform.
      The CDP Public Cloud web interface landing page opens.
    2. From the Your Enterprise Data Cloud landing page, select the Observability tile.
      The Cloudera Observability landing page opens.
  2. In the Clusters page do one of the following:
    • In the Search field, enter the name of the cluster whose workloads you want to analyze.
    • From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
  3. From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
  4. In the Usage Analysis chart, click the engine whose Failed column displays the number of jobs that did not complete.
  5. Depending on the engine you selected, in the engine's page that opens scroll down to either the Job Duration or the Query Duration chart widget and click the health check bar of interest.
    The Jobs or Queries page opens, listing all the jobs or queries that have been run during the time period, their health status, the length of time the job or query took to run, the user ran the job or query, and the job or query identification number.
  6. Specify a specific amount of time in which the job either ran less than or more than the Health check rule by either selecting a predefined time duration or selecting Customize and enter the minimum or maximum time period from the Duration list.

  7. View more details about a job by selecting a job's name from the Job column and then clicking the Health Checks tab.
    The Baseline Health checks are displayed.
  8. Display more information about the job's duration by selecting Duration from the Baseline section. As shown in the image below.
    In the following example, the job finished much slower than the baseline duration, which is the aggregate calculated over multiple jobs.

  9. Compare and analyze this job against other baseline metrics by clicking View all metrics.
  10. Continue to analyze and search for probable causes by doing one or more of the following:
    • To display more information about the length of time the processing tasks took within a job, select Task Duration, which opens a panel that describes the health check, displays information about the possible causes, and lists recommended solutions.
      In the following example, issues arose during Stage-9 of the job due to poor parallelization. The Recommendation section lists items for you to complete that may resolve the problem and the specific outlier tasks that produced the unusual results:

    • To display more details about an outlier, click the outlier task, which opens the Task Details panel.
      In the following example, the Task Details show that the outlier task took significantly more time to complete compared to previous runs. In this case, 41 minutes as compared to 8 minutes:

    • To gain more insights about the task's duration, such as checking memory allocation, click the Execution Details tab and then in the Summary panel, click Configurations:

    • In the Configurations panel, click the Spark Properties tab and search for the memory configuration settings and their values. If memory is less than the recommended value, increasing its value will improve cluster performance: