Performance metrics of ML workloads by category
You can monitor the performance of workloads using the job, session, model, and application categories and understand reasons for workload failures, particularly if caused by resource exhaustion. To identify resource constraints, you can analyze CPU, RAM, GPU, and disk usage.
To effectively manage and optimize your Cloudera AI workflows, you must
monitor the performance of individual jobs. Tracking key metrics such as duration, resource
utilization (CPU allocated, GPU allocated, and memory allocated), usage analysis and execution
trend helps identify bottlenecks and improve overall efficiency.
- Jobs: Monitor long-running jobs.
- Sessions: Monitor long-running sessions.
- Models: Monitor all active models currently deployed on your workbench. Prioritize time to optimize deployments based on insights gained from monitoring.
- Applications: Monitor long-running web applications.
Charts on workload category dashboard
Lists the chart name and metrics displayed on each chart.
Chart | Metrics |
---|---|
|
|
|
|
|
|
|
|