Machine Learning (ML) workload metrics and status details

Learn how to access the ML workload and workspace information, filter and search for details to focus on anomalies, and monitor the status of jobs, models, applications, and sessions in Cloudera Observability.

How to access the ML workloads page

You can access the ML Workloads page by clicking the links on the workspace and workload category pages.
Page Links available on charts
  • Total ML Workloads
  • Failed ML Workloads
  • ML Workloads Execution Trends
  • Usage Analysis
  • Execution Trends
  • Duration

Filter options for ML workloads

You can use the following filters to minimize the list of workloads and their types and focus on specific anomalies:
  • Search: Search for a specific workload
  • Status: Select any one status or multiple workload statuses. For information on these statuses, see ML workload statuses.
  • Run As: Select a specific user name. By default, all user names are displayed.
  • Project: Select a specific project name. By default, all project names are displayed.
  • Type: Select any one workload type or multiple workload types.

    If you have selected the workload type in the previous workspace or workload category page, the data for the selected workload type is displayed.

  • Duration: Select the duration for how long the workload is running. By default, all durations are displayed.
  • Range: List the time range. By default, the data for the last 24 hours are displayed.
Besides these filters, the ML Workloads table includes the following additional columns:
  • Team: Displays the team name if a workload is run as part of the team project.
  • Kernel: Displays the pre-installed Python version of Jupyter kernel.
  • CPU Cores: Displays total allocated CPU cores.
  • Memory: Displays total allocated memory.
  • GPU Cores: Displays total allocated GPU cores.
  • Start Time: Displays the workload start time in Indian Standard Time (IST).

Child ML workloads

For the Job and Sessions parent workload, you can monitor ML Worker and Spark Executor child workloads. These are run as part of the workloads as a separate pod.

ML workload statuses

Lists the status available for each workload category.
Job status Description
Stopped The user has stopped the job. The pod has been deleted.
Succeeded The job completed successfully with a zero (0) exit code.
Failed Failed to start the job. The reason can be resource constraints or errors in job metadata.
Timed Out The job has timed out and will no longer run.
Build Failed Not applicable.
Model status Description
Stopped The user has stopped the model.
Succeeded The model has been deployed successfully.
Failed Failed to deploy the model.
Timed Out The model has timed out and will no longer be deployed.
Build Failed The model fails during the build stage.
Application status Description
Stopped The user has stopped the application.
Succeeded The application pod has been successfully deployed.
Failed Failed to deploy the application.
Timed Out The application has timed out and will no longer be deployed.
Build Failed Not applicable.
Session status Description
Stopped The user has stopped the session.
Succeeded The session pod has been successfully executed.
Failed Failed to execute the session.
Timed Out The session timed out either due to inactivity or due to an absolute timeout.
Build Failed Not applicable.