Troubleshooting with the Job Comparison Feature

The Job Comparison feature makes it easy to compare two different runs of the same Data Engineering job. This is especially useful when you notice that something changes unexpectedly. For example, if you have a job that consistently completes within a specific amount of time and then it starts taking longer, you want to know why. The Job Comparison feature makes it easy to quickly see the difference between two runs of the same job so you can troubleshoot the cause.

When a job is flagged as slow, there is a Compare with Previous Run link in the job page that opens the Job Comparison tool and compares the current run of the job with the last run of the job. You can also compare any two runs of a job by following the steps below:

  1. Log in to the Workload XM console at: wxm.cloudera.com, and in Search, type the name of the cluster that is running the jobs you want to analyze.
  2. In the left menu under Data Engineering, select Jobs, and click the down-arrow next to Range on the right side of the filter menu to select the date range for the jobs you want to compare:



  3. View the list of jobs that have executed during the selected time period:



    The spark-etl job runs often, but the last three runs of this job have been taking significantly longer. Notice that on August 2, the duration was 27 minutes, but the next day on August 3, the duration almost doubled to 51 minutes. The Job Comparison feature enables you to examine both runs to determine why the duration changed.

  4. Click one of the runs of the spark-etl job, and then in the Jobs detail page, click the Trends tab:



  5. On the Trends page up to 30 runs prior to the selected job run display in a list. Immediately you can see in the Input and Output columns that there are different amounts of data processed by the job. For example, on August 2, the job processed 2.4 GB of data and output 1.8 GB. However, the next day on August 3, the job processed 4.2 GB, almost twice as much data, and output even more at 4.6 GB. What is causing this change?



    Select the check boxes adjacent to the job runs you want to compare. In the above example, the runs for August 2 and August 3 are selected. Then click Compare.

  6. On the Job Comparison page you can view more details for both of these jobs. The tabs that are most useful for this comparison are the Structure, Configurations, and the SQL Executions tabs:



    The SQL Executions tab is only available for Spark jobs.

  7. On the Structure tab page, it shows the sub-jobs executed for both runs of the spark-etl job:



    The job that took 27 minutes only executed 9 sub-jobs, but the job that took 51 minutes, almost twice as much time, executed 16 sub-jobs, almost twice as many. You can click any of the sub-jobs listed to view more details.

    After the Configurations tab was examined, it showed that the configurations between the two runs of this job were identical, so a configuration change most likely did not cause this anomaly.

  8. On the SQL Executions tab page, it shows that twice as many Spark queries executed for the job that took the longest duration:



    This analysis tells us that either the Spark SQL code was changed by the Job Developer or the data that the code ran against triggered more of the Spark queries in the job.

    The Workload XM Job Comparison feature has helped you quickly narrow the number of causes that produced this anomaly. In this case, the change in job duration seems to be expected so no further troubleshooting is required.