Summary of all ML workspaces in Cloudera Observability

You can monitor a summary of all workspaces and track usage trends for both workloads and infrastructure in Cloudera Observability. This view helps you decide which workspace to investigate further by identifying potential issues based on cluster activity, peak and low times, deviations, and other indicators.

How to select a time range

By default, displays workload data for the last 24 hours. You can select a different time range from the time range list. All charts and tables on the ML summary dashboard are updated to reflect the workload data for the selected period. For information about the time-range list options, see Specifying a time range.

How the workload performance metrics across all workspaces are represented

The Machine Learning summary page provides detailed information for all workspaces, listing multiple workspaces running under the specific machine learning service.

  • Total ML WORKLOADS: A bar chart illustrates the aggregate count of total ML workloads across all the workspaces within the Cloudera AI environment, categorized by jobs, sessions, applications, and models. Each category is depicted by horizontal bars, with their values summed up. The highest value is prioritized at the top.
  • FAILED ML WORKLOADS: A bar chart visualizes the combined count of total ML workloads and unsuccessful ones. Jobs, sessions, applications, and model categories show the count of failed ML workloads. The highest count of failed workloads takes precedence at the top.
  • AVERAGE SYSTEM BUSYNESS: A line chart displays the average consumption of resources across all your workspaces at the Cloudera AI level, illustrated as a percentage. This metric helps you understand concurrent processes, CPU utilization, memory usage, network traffic, and storage access, indicating whether you over-allocate or under-allocate resources to your workspaces. Hover over a bar to view the average percentage of system busyness activity.

How to analyze workspace usage

The Workspace Usage Analysis page lists active workspaces based on their average busyness metrics. The Workspace Usage Analysis page opens with the following options:
  • Top 5 Allocated: Displays the top five allocated workspaces, calculated based on the busyness usage, not the percentage. If the number of active workspaces is fewer than five based on the busyness metrics, only those workspaces are listed.
  • Bottom 5 Allocated: Displays the bottom five workspaces, calculated based on the busyness usage, not the percentage.
  • Custom 5: Allows you to filter custom five workspaces. Select five or less than five workspaces and click Apply.
From the workspace list, in the Name column, click the workspace link for detailed usage analysis for each workspace. See Resource utilization and performance metrics for Machine Learning (ML) workspace.

Resource utilization across workspaces

  • CPU: Provides a historical overview of CPU usage at the workspace level. Hover over to see CPU usage as a percentage and the actual CPU utilized compared to the available CPU.
  • GPU: Provides a historical overview of GPU usage at the workspace level. Hover over to see how much GPU core used compared to the allocated GPU cores.
  • Memory: Provides a historical view of memory usage with workspace granularity. Hover over to view memory usage in percentage and actual memory used compared to the available memory.
  • GPU Memory: Provides a historical view of GPU memory usage with workspace granularity. Hover over to view GPU memory usage in bytes and actual GPU memory used compared to the allocated GPU memory.
  • Network: Measures in Mebibyte (MiB) and Gibibyte (GiB). Offers a historical perspective on network activity. Hover over to view the number of bytes received by all workspaces and bytes transmitted by all workspaces in a list form.
  • Storage: Displays IOPS and Throughput. Hover over to view information on reads and writes in bytes/seconds.
    • IOPS: The IOPS metric shows how many read and write operations a storage device can perform per second. A single operation is performed on one Hard Disk Drive (HDD) normally has 512 B or 4 KB blocks, whereas modern Solid State Drive (SSD) expose storage memory in pages joined in blocks that can reach 512 KB in size.
    • Throughput: Storage throughput (data transfer rate) measures the data transfer to and from the storage device per second. Normally, throughput is measured in Megabytes. Throughput is closely related to IOPS and block size.