Performance metrics of Machine Learning (ML) workloads by category
You can monitor the performance of workloads using the job, session, model, and application categories and understand reasons for workload failures, particularly if caused by resource exhaustion. To identify resource constraints, you can analyze CPU, RAM, GPU, and disk usage.
To effectively manage and optimize your machine learning workflows, you must monitor the
performance of individual jobs. Tracking key metrics such as duration, resource utilization (CPU
allocated, GPU allocated, and memory allocated), usage analysis and execution trend helps
identify bottlenecks and improve overall efficiency.
- Jobs: Monitor long-running jobs.
- Sessions: Monitor long-running sessions.
- Models: Monitor all active models currently deployed on your workspace. Prioritize time to optimize deployments based on insights gained from monitoring.
- Applications: Monitor long-running web applications.
Charts on workload category dashboard
Lists the chart name and metrics displayed on each chart.
Chart | Metrics |
---|---|
|
|
|
|
|
|
|
|