Performance metrics of AI workloads by category
You can monitor the performance of workloads using the job, session, model, and application categories and understand reasons for workload failures, particularly if caused by resource exhaustion. To identify resource constraints, you can analyze CPU, Memory, and GPU.
Tracking key metrics such as duration, resource utilization (CPU allocated, GPU allocated, and
memory allocated), usage analysis and execution trend helps identify bottlenecks and improve
overall efficiency.
- Jobs: Monitor long-running jobs.
- Sessions: Monitor long-running sessions.
- Models: Monitor all active models currently deployed on your workbench. Prioritize time to optimize deployments based on insights gained from monitoring.
- Applications: Monitor long-running web applications.
Charts on workload category dashboard
Lists the chart name and metrics displayed on each chart.
| Chart | Metrics |
|---|---|
|
|
|
|
|
|
|
|
|
|
