ML workload metrics and status details

Learn how to access the ML workload and workbench information, filter and search for details to focus on anomalies, and monitor the status of jobs, models, applications, and sessions in Cloudera Observability.

How to access the ML workloads page🔗

You can access the ML Workloads page by clicking the links on the workbench and workload category pages.


Page	Links available on charts
Workspace	Total ML Workloads Failed ML Workloads ML Workloads Execution Trends
Workload	Usage Analysis Execution Trends Duration

Filter options for ML workloads🔗

You can use the following filters to minimize the list of workloads and their types and focus on specific anomalies:

Search: Search for a specific workload
Status: Select any one status or multiple workload statuses. For information on these statuses, see ML workload statuses.
Run As: Select a specific user name. By default, all user names are displayed.
Project: Select a specific project name. By default, all project names are displayed.
Type: Select any one workload type or multiple workload types.
If you have selected the workload type in the previous workbench or workload category page, the data for the selected workload type is displayed.
Sub Type: Select the child workload type ML Worker, Spark Executor, or All to display child workloads.
Duration: Select the duration for how long the workload is running. By default, all durations are displayed.
Range: List the time range. By default, the data for the last 24 hours are displayed.

Besides these filters, the ML Workloads table includes the following additional columns:

Name: Displays the name of a job, allowing easy identification.
Creator: Displays the name of the individual who created the workload, giving insight into job ownership.
Team: Displays the team name if a workload is run as part of the team project.
Kernel: Displays the pre-installed Python version of Jupyter kernel.
CPU Cores: Displays total allocated CPU cores.
Memory: Displays total allocated memory.
GPU Cores: Displays total allocated GPU cores.
Start Time: Displays the workload start time in Indian Standard Time (IST).

Child ML workloads🔗

For the Job and Sessions parent workload, you can monitor ML Worker and Spark Executor child workloads. These are run as part of the workloads as a separate pod.

ML workload statuses🔗

Lists the status available for each workload category.


Job status	Description
Stopped	The user has stopped the job. The pod has been deleted.
Succeeded	The job completed successfully with a zero (0) exit code.
Failed	Failed to start the job. The reason can be resource constraints or errors in job metadata.
Timed Out	The job has timed out and will no longer run.
Build Failed	Not applicable.


Model status	Description
Stopped	The user has stopped the model.
Succeeded	The model has been deployed successfully.
Failed	Failed to deploy the model.
Timed Out	The model has timed out and will no longer be deployed.
Build Failed	The model fails during the build stage.


Application status	Description
Stopped	The user has stopped the application.
Succeeded	The application pod has been successfully deployed.
Failed	Failed to deploy the application.
Timed Out	The application has timed out and will no longer be deployed.
Build Failed	Not applicable.


Session status	Description
Stopped	The user has stopped the session.
Succeeded	The session pod has been successfully executed.
Failed	Failed to execute the session.
Timed Out	The session timed out either due to inactivity or due to an absolute timeout.
Build Failed	Not applicable.