Troubleshooting with the Job Comparison Feature

Steps for comparing two different runs of the same job, which is especially useful when you notice unexpected changes. For example, when you have a job that consistently completes within a specific amount of time and then it starts taking longer, comparing two runs of the same job enables you to analyze the differences so that you can troubleshoot the cause.

Describes how to compare any two runs of a job using the Job Comparison tool.

Steps with examples are included that help explain how to further investigate and troubleshoot.

  1. In a supported browser, log in to the Workload XM web UI by doing the following:
    1. In the web browser URL field, enter the Workload XM URL that you were given by your system administrator and press Enter.
    2. When the Workload XM Log in page opens, enter your Workload XM user name and password access credentials.
    3. Click Log in.
  2. In the Search field of the Clusters page, enter the name of the cluster whose workloads you want to analyze.
  3. From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
  4. In the Trend widget, select the tab of an engine whose jobs you want to analyze and then click its Total Jobs value.
    The engine's Jobs page opens.
  5. Examine the list of jobs that have executed during the selected time period:
    For example, as shown in the following image, the last three runs of the spark-etl job have taken significantly longer to complete than usual. Where, on August 2, the duration was 27 minutes, but on August 3, the duration almost doubled to 51 minutes:

  6. List and display details of all the runs of a specific job, by selecting one of the job runs and then in the Jobs details page, click the Trends tab.
    In the following example, notice how the amount of data changes in the Input and Output columns. Where, on August 2, the job processed 2.4 GB of data, but on August 3, the job processed 4.2 GB, which is almost twice as much data. The Job Comparison tool will enable you to examine both runs to determine why the amount of data changed:

  7. To compare two job runs, select the check boxes adjacent to the job runs you require, in this case the runs for August 2 and August 3 are selected, and then click Compare.
    The Job Comparison page opens displaying more details about each job.
    For this example's comparison, the tabs that contain more information about the job runs are the Structure, Configurations, and the SQL Executions tabs:

  8. Display the sub-jobs executed for both of your selected job runs by selecting the Structure tab.
    For example, as shown in the following image, the job that took 27 minutes only executed 9 sub-jobs and the job that took 51 minutes, almost twice as much time, executed 16 sub-jobs, almost twice as many. Clicking any of the listed sub-jobs displays more details.

  9. Display the jobs configurations by selecting the Configurations tab.
    For our example, the configurations between the two runs of this job were identical, so a configuration change probably did not cause the anomaly.
  10. Display the number of SQL executions by selecting the SQL Executions tab.
    For our example, as shown by the following image, twice as many Spark queries executed for the job that took the longest duration.

The analysis from the Job Comparison tells us that either the Spark SQL code was changed by the Job Developer or that the data on which the code ran triggered more of the Spark queries in the job. The Workload XM Job Comparison tool helped narrow the number of causes that produced the anomaly. For our example, the change in job duration appears to be expected so no further troubleshooting is required.