Infrastructure

The Real-time Infrastructure tab displays Kubernetes pod metrics to help you investigate performance issues in your Cloudera AI workbenches. You can use this data to identify CPU spikes, memory quota issues, or failed pods and view relevant resource usage charts and pinpoint pods with extreme resource consumption.

Monitoring the infrastructure of your Cloudera AI workbenches is essential for identifying and debugging performance issues at a pod level. The Real time Infrastructure tab provides a holistic view of the Kubernetes pod metrics associated with your workbenches. You can use this data to investigate failed pods, memory quota issues, or CPU utilization spikes that might affect your Cloudera AI workbench.

The interface allows you to navigate to detailed information by drilling down from a high-level summary of total pods to specific pod metrics across multiple namespaces. By default, the charts display metrics for the five namespaces with the highest pod counts, but you can select up to five specific namespaces to focus your analysis. This filtering mechanism ensures that the displayed charts for CPU, memory, and storage are relevant to your specific workbenches.

The page includes widgets that highlight the top five and bottom five pods based on their resource usage. For example, you can identify which pods are consuming the most memory or which pods have the lowest network throughput. All data displayed in these charts and tables is fetched using a fixed two-minute interval, providing a rolling 30-minute window of recent activity.

Infrastructure metrics reference

The following table describes the primary metrics available on the Real Time Infrastructure tab:
Widgets Description
Total Pods Displays the total count of running and failed pods in the selected namespace.
Running Pods The total number of pods currently in the Running state within a namespace.
Failed Pods Displays a list of pods which are marked in Failed state along with their namespace and time of failure.
Resource Quotas (CPU and Memory) Shows CPU and memory usage against established limits for the selected namespaces.

You can sort the tables in ascending or descending order by usage.

Top 5 and Bottom 5 Charts depicting the highest and lowest resource-consuming pods.

Hover over a data point to see precise values, such as specific cores or gigabytes used.

Network & Storage Real-time tracking of IOPS, bytes received and transmitted, packets received and transmitted, and packet error rates.

Visualized as line charts representing data over the last 30 minutes.

Column Description
Namespace Namespace within the workbench
Pod Name

Name of the pod within a namespace

Usage

Displays the actual CPU processing power and memory usage the pod currently consumes, measured in cores and MiB or GiB respectively.

Idle Displays the amount of allocated CPU processing power that the pod is not currently using.
CPU Usage/Requests Displays the percentage of the requested CPU resources that the pod is actively utilizing.
CPU Usage/Limits Displays the percentage of the hard CPU limit that the pod is currently consuming.
Memory Usage/Requests Displays the percentage of the memory (Request) that the pod is currently using.
Memory Usage/Limits Displays the percentage of the maximum allowed memory (Limit) that the pod is currently using.

Widgets

Widget Name Description
CPU Utilization Displays the amount of processing power consumed by specific pods, represented in cores.
Memory Utilization Displays the amount of memory consumed by specific pods, represented in MiB or GiB.
Storage IOPS (Reads + Writes) Displays the number of read and write operations per second (Input/Output Operations Per Second) to indicate activity, measured in bytes per second.
Storage IO (Reads + Writes) Displays the throughput of data being read from or written to storage, measured in bytes per second.
Bytes Received or Transmitted Displays the volume of network traffic entering and leaving the pods over time.
Rate of Received or Transmitted Displays the frequency of network packets being sent or received, which helps identify network throughput patterns.
Packets Packet Errors Displays the number of failed network transmissions, a value above zero typically indicates a network issue.