Identifying and addressing performance problems of your ML workloads

Identify inefficient phases of your workloads for optimization and performance tuning.

Describes how to compare any two runs of an ML workload using the Comparison tool.

  1. Verify that you are logged in to the Cloudera Observability web UI and that you selected an environment from the Analytics Environments page.
    1. In a supported browser, log into the Cloudera.
      The Cloudera web interface landing page opens.
    2. From the Your Enterprise Data Cloud landing page, select the Observability tile.
      The Cloudera Observability landing page opens to the main navigation panel.
    3. From the Cloudera Observability Environments page, select the environment required for analysis.

      The Environment navigation panel opens.

  2. From the Environment Name column on the Environments page, locate and click environment name whose workload diagnostic information requires analysis and troubleshooting.
    The Environment navigation panel opens, which hierarchically lists the environment and its services hosted on the selected environment.
  3. Depending on the environment selected, verify that the Cluster Summary page is displayed for the environment's cluster required for analysis.
  4. If not already expanded, from the Environment navigation panel, expand the Machine Learning environment, and then select the ML workspace.
  5. Optional: From the time-range list, select a time period that meets your requirements.
  6. Click the workload links in the Jobs, Sessions, Models, and Applications chart widgets.
    The ML Workload list page opens.
  7. From the Name column, select the ML workload.
    The ML workload details page opens for the selected ML workload category (Jobs, Sessions, Models, or Applications).
  8. To measure the current performance of a workload against the average performance of previous runs, select the Baseline tab.
    The Baseline tab captures and presents metrics for each execution, where one execution represents a single job run per day. These metrics include CPU utilization (allocated), memory usage (allocated), and execution duration. If the same job is executed on subsequent days and significant deviations in metrics are observed, You can analyze the discrepancies and determine potential causes.

    For information on Baseline metrics, see Baseline health checks.

  9. To troubleshoot performance-related issues between two different runs of the same workload, do the following:
    1. From the workload details page, select the Trends tab.
    2. Scroll down and from the table, select the check boxes adjacent to the workload job runs that you require, such as the latest run with a run from a week ago, and then click Compare.
      The Execution Comparison page opens, displaying more details about the selected workload.
    3. From the Details section, select the Basics tab and review the workload details executed for both of your selected workload runs.
    4. Select the Metrics tab and compare statistical differences between the selected workload runs. For example, you can identify differences in the workload run durations.