Infrastructure
The Real-time Infrastructure tab displays Kubernetes pod metrics to help you investigate performance issues in your Cloudera AI workbenches. You can use this data to identify CPU spikes, memory quota issues, or failed pods and view relevant resource usage charts and pinpoint pods with extreme resource consumption.
Monitoring the infrastructure of your Cloudera AI workbenches is essential for identifying and debugging performance issues at a pod level. The Real time Infrastructure tab provides a holistic view of the Kubernetes pod metrics associated with your workbenches. You can use this data to investigate failed pods, memory quota issues, or CPU utilization spikes that might affect your Cloudera AI workbench.
The interface allows you to navigate to detailed information by drilling down from a high-level summary of total pods to specific pod metrics across multiple namespaces. By default, the charts display metrics for the five namespaces with the highest pod counts, but you can select up to five specific namespaces to focus your analysis. This filtering mechanism ensures that the displayed charts for CPU, memory, and storage are relevant to your specific workbenches.
Infrastructure metrics reference
| Widgets | Description |
|---|---|
| Total Pods | Displays the total count of running and failed pods in the selected namespace. |
| Running Pods | The total number of pods currently in the Running state within a namespace. |
| Failed Pods | Displays a list of pods which are marked in Failed state along with their namespace and time of failure. |
| Resource Quotas (CPU and Memory) | Shows CPU and memory usage against established limits for the selected
namespaces. You can sort the tables in ascending or descending order by usage. |
| Top 5 and Bottom 5 | Charts depicting the highest and lowest resource-consuming pods. Hover over a data point to see precise values, such as specific cores or gigabytes used. |
| Network & Storage | Real-time tracking of IOPS, bytes received and transmitted, packets received and
transmitted, and packet error rates. Visualized as line charts representing data over the last 30 minutes. |
| Column | Description |
|---|---|
| Namespace | Namespace within the workbench |
| Pod Name |
Name of the pod within a namespace |
| Usage |
Displays the actual CPU processing power and memory usage the pod currently consumes, measured in cores and MiB or GiB respectively. |
| Idle | Displays the amount of allocated CPU processing power that the pod is not currently using. |
| CPU Usage/Requests | Displays the percentage of the requested CPU resources that the pod is actively utilizing. |
| CPU Usage/Limits | Displays the percentage of the hard CPU limit that the pod is currently consuming. |
| Memory Usage/Requests | Displays the percentage of the memory (Request) that the pod is currently using. |
| Memory Usage/Limits | Displays the percentage of the maximum allowed memory (Limit) that the pod is currently using. |
Widgets
| Widget Name | Description |
|---|---|
| CPU Utilization | Displays the amount of processing power consumed by specific pods, represented in cores. |
| Memory Utilization | Displays the amount of memory consumed by specific pods, represented in MiB or GiB. |
| Storage IOPS (Reads + Writes) | Displays the number of read and write operations per second (Input/Output Operations Per Second) to indicate activity, measured in bytes per second. |
| Storage IO (Reads + Writes) | Displays the throughput of data being read from or written to storage, measured in bytes per second. |
| Bytes Received or Transmitted | Displays the volume of network traffic entering and leaving the pods over time. |
| Rate of Received or Transmitted | Displays the frequency of network packets being sent or received, which helps identify network throughput patterns. |
| Packets Packet Errors | Displays the number of failed network transmissions, a value above zero typically indicates a network issue. |
