Resource utilization and performance metrics for Machine Learning (ML) workspace
Understand how to access ML workspace and explore detailed information on ML workloads, including infrastructure analysis for nodes, pods, namespaces, and more.
How to access the ML workspace
- Navigate to the Machine Learning data service, and select the ML workspace from the workspace list.
- Navigate to the ML summary page. In the Workspace Usage Analysis section, from the active workspaces list, click the workspace name link.
How to select a time range
By default, displays workload data for the last 24 hours. You can select a different time range from the time range list. All charts and tables on the ML workspace dashboard are updated to reflect the workload data for the selected period. For information about the time-range list options, see Specifying a time range.
How the workload performance metrics within the selected workspace are represented
- Total ML WORKLOADS: A bar chart illustrates the aggregate count of total ML workloads within the selected workspace, categorized by jobs, sessions, applications, and models. Each category is depicted by horizontal bars, with their values summed up. The highest value is prioritized at the top.
- FAILED ML WORKLOADS: A bar chart visualizes the combined count of total ML workloads and unsuccessful ones within the selected workspace, categorized by jobs, sessions, applications, and models. Each classification is represented by horizontal bars, aggregating their respective values. The highest count of failed workloads takes precedence at the top.
- AVERAGE SYSTEM BUSYNESS: A line chart displays the average consumption of resources for the selected workspace, illustrated as a percentage. This metric indicates whether you over-allocate or under-allocate resources to the selected workspace.
- ML Workloads Execution Trends: A trend chart illustrates total and failed workloads according to the average system busyness time range.
- Usage Analysis: A bar chart displays an analysis of individual
workloads, projects, users, and teams within the workspace based on busyness metrics.
- ML workloads: Lists top 25 workloads.
- Projects: Lists the projects to which the workload belongs.
- Users: Lists users running workloads based on namespaces.
- Teams: Lists the team if the workload is part of a project
that belongs to a team.
You can filter workloads, projects, users, and teams by the following categories to identify those with higher CPU, GPU, and memory usage: CPU Allocated, GPU Allocated, and Memory Allocated.
These insights help administrators understand which user and the project heavily consumes resources and takes system time. Accordingly, the administrator can monitor and manage projects.
How to analyze resource usage by nodes
- Total: Shows aggregated usage across the nodes, depicting total used and allocated values, with scale-up and scale-down actions indicated.
- Top 5 busiest: Displays the top five nodes that consumed more resources based on the busyness metrics, calculated based on the actual usage values, not the percentage.
- Bottom 5 busiest: Displays the bottom five nodes, calculated based on the actual usage values, not the percentage.
- Custom 5: Allows you to filter custom five nodes. Select five or less than five nodes and click Apply.
Resource utilization within the workspace
Evaluate how effectively computational resources are utilized within the workspace.
- CPU: Provides a historical view of CPU usage with individual workspace granularity. Hover over to view CPU usage in percentage and actual CPU usage compared to available CPU usage.
- Memory: Provides a historical view of memory usage within the workspace. Hover over to view memory usage in percentage and memory used compared to available memory.
- Network: Offers a historical perspective on network activity, indicating the bytes received and transmitted over the network interface during processing. Hover over to view the number of bytes received by the selected workspace and the number of bytes transmitted by the workspace in a list form.
- Storage: Displays IOPS and Throughput (Hover over to view information on reads and writes in bytes/seconds).