Performance metrics of Machine Learning (ML) workloads by category

You can monitor the performance of workloads using the job, session, model, and application categories and understand reasons for workload failures, particularly if caused by resource exhaustion. To identify resource constraints, you can analyze CPU, RAM, GPU, and disk usage.

To effectively manage and optimize your machine learning workflows, you must monitor the performance of individual jobs. Tracking key metrics such as duration, resource utilization (CPU allocated, GPU allocated, and memory allocated), usage analysis and execution trend helps identify bottlenecks and improve overall efficiency.
  • Jobs: Monitor long-running jobs.
  • Sessions: Monitor long-running sessions.
  • Models: Monitor all active models currently deployed on your workspace. Prioritize time to optimize deployments based on insights gained from monitoring.
  • Applications: Monitor long-running web applications.

Charts on workload category dashboard

Lists the chart name and metrics displayed on each chart.
Chart Metrics
  • Jobs
  • Sessions
  • Models
  • Applications
  • Shows detailed resource consumption by each job, session, model, and application.
  • Filter top jobs, sessions, models, and applications by selecting the following options:
    • Duration (in seconds, minutes, and hours)
    • CPU Allocated (in cores)
    • GPU Allocated (in cores)
    • Memory Allocated (in Mebibyte (MiB) and Gibibyte (GiB)
  • Hovering over the data displays the name, execution ID, user, and selected filter category name.
  • Job Usage Analysis
  • Session Usage Analysis
  • Model Usage Analysis
  • Application Usage Analysis
  • Categorized into three types: Users, Teams, and Projects.
  • Filter usage by CPU Allocated (in cores), Memory Allocated (in Gigabytes), and GPU Allocated (in cores).
  • Hovering over the data shows the selected category name and usage information.
  • Click the usage link to navigate to the ML Workloads page. For information, see Machine Learning (ML) workload metrics and status details.
  • Job Execution Trends
  • Session Usage Trends
  • Model Usage Trends
  • Application Usage Trends
  • Displays the number of total ML workloads and failed workloads within the workspace. The data is displayed based on the selected date filter.
  • Click the ML workload number link to navigate to the ML Workloads page. For information, see Machine Learning (ML) workload metrics and status details
  • Job Duration
  • Session Duration
  • Model Duration
  • Application Duration
  • Presents median duration values for the job, session, model, and application, measured in hours, minutes, seconds, and milliseconds.
    • The top median value indicates the duration for all workloads displayed on the bar chart.
    • The individual duration values on the bar chart represent the number of workloads completed within each specified time range.
  • Click the median value link to navigate to the ML Workloads page. For information, see Machine Learning (ML) workload metrics and status details