Infrastructure metrics

From the Cloudera Observability Real-time Monitoring (RTM) Infrastructure page, you can monitor the infrastructure of a Virtual Machine-based system in real-time. You can monitor the infrastructure of a Cloudera Data Hub and Cloudera on premises at the node level to gain insights into the capacity and usage of infrastructure components such as CPU, memory, storage, node count, and network.

Monitoring the resource utilization at a node level is essential to ensure optimal performance and availability. From the Infrastructure tab, you can visualize and analyze the metrics and understand whether any backend issue impacts users. This allows you to monitor any system disruptions or suspicious activities and fix them before the issue becomes bigger or critical.

Observe infrastructure metrics from the Overview view

The Overview view displays the overall aggregated metrics of the Cloudera Data Hub and Cloudera on premises cluster. As an administrator, you can monitor the statistics with the following options:

Total Memory: Displays the total memory used by the Cloudera Data Hub and Cloudera on premises Cluster. The units shown vary according to the actual memory value. For example, the units can be B, KiB, MiB, GiB, or TiB.
Total CPU: Displays the total number of CPU cores used by the Cloudera Data Hub and Cloudera on premises Cluster.
CPU Usage: The CPU usage chart displays the percentage of CPU actively used by the Cloudera Data Hub and Cloudera on premises Cluster as of the total CPU allocated to it.
Memory Usage: The Memory usage chart displays the percentage of memory used by the Cloudera Data Hub and Cloudera on premises Cluster as of the total memory allocated to it.
Storage Usage: The Storage usage chart displays the percentage of storage space used by the Cloudera Data Hub and Cloudera on premises Cluster as of the total storage allocated to it.
NODE COUNTS: Displays the total number of nodes the Cloudera Data Hub and Cloudera on premises Cluster uses.

Monitor node status

On the Node Busyness chart, you can monitor the node utilization of the Cloudera Data Hub and Cloudera on premises Cluster to ensure a healthy and stable Cloudera Data Hub and Cloudera on premises Cluster. The chart provides a historical view of the node activity, represented as a percentage. This metric helps understand the node utilization and associated costs, calculated using the Cloudera Compute Unit (CCU) measurement, which combines CPU and memory usage. The value helps determine which nodes are over-provisioned and which ones require scaling up.

Node Busyness includes the following filters:

Top 5 Busiest: Displays the top five nodes of the Cloudera Data Hub and Cloudera on premises Cluster based on the busyness usage.
Bottom 5 Busiest: Displays the bottom five nodes of the Cloudera Data Hub and Cloudera on premises Cluster based on the busyness usage.
Custom: Displays the nodes of the Cloudera Data Hub and Cloudera on premises Cluster that you select. Click the drop-down arrow and select the nodes for which you want to see the data.

Monitor CPU and memory usage metrics by node

If you encounter performance issues and suspect CPU or memory usage as the cause, you can monitor usage by nodes.

Monitoring the CPU allows you to monitor the capacity, usage, and availability of the CPU. To monitor the CPU usage by nodes, examine the metrics on the following graphs:

CPU % UTILIZATION PER NODE: Displays the percentage of CPU each node uses.
CPU LOAD AVG: This displays the average CPU load for each node. CPU load refers to the number of processes that are either currently being executed by the CPU or are waiting for execution. An idle system has a load of 0. For every process either running or waiting, the load increases by 1.
CPU USAGE PER NODE: Displays the total number of cores of CPU being used by each node.

Monitoring memory allows you to monitor the memory capacity, usage, and availability. To monitor the memory usage by nodes, examine the metrics on the following graphs:

NODE MEMORY USAGE %: Displays the percentage of memory used by each node. It is the ratio of the memory used by the node to the total memory allocated to the node.
MEMORY USAGE PER NODE: Displays the amount of memory used by each node. The units shown vary according to the actual memory usage value. For example, the units can be B, KiB, MiB, GiB, or TiB.
SWAP TOTAL, SWAP USED: Displays the percentage of swap (partition) space utilized by each node. It is the ratio of the swap space used to the total swap space allocated to the node. It also displays the total number of swaps used when you hover over each line in the chart.

Monitor storage usage on your infrastructure

Monitoring storage allows you to observe, analyze, and manage the storage systems within the Cloud infrastructure. Administrators can track the availability, performance, operational status, and health of Cloud servers and components. It helps identify and address storage-related issues, disruptions, vulnerabilities, or suspicious activities before they escalate or become critical.

DISK IOPS: Allows you to monitor a storage device's responsiveness to requests by monitoring its throughput and IOPS.

Throughput: The amount of data read to or written from the storage device per second. Monitoring throughput ensures that a storage device is not regularly maxed out at its highest throughput rate. The units shown vary according to the actual throughput values. For example, the value can be KiB/s, MiB/s, or GiB/s.
IOPS: The number of read and write operations that a storage device is performing successfully per second. The actual IOPS value must be close to the defined IOPS value for each device for seamless performance. Measured in io/s.

Monitor network usage

Monitoring the network usage of the nodes in the Cloudera Data Hub and Cloudera on premises cluster allows you to identify when and where the network is having issues to ensure business continuity, especially since early troubleshooting helps prevent network failures. Network monitoring also enables you to detect cloud infrastructure problems, analyze application interactions and traffic flow, and identify the problems sooner which allows you to increase efficiency and flexibility, controlled cost, better utilization of IT resources and personnel, and access to historical network data for analytics.

BYTES RECEIVED: The total volume of data received by the network interface during processing for each node. The units shown vary according to the actual volume of data received. For example, the units can be B, KiB, MiB, GiB, or TiB.
BYTES TRANSMITTED: The total volume of data transmitted by the network interface during processing for each node. The units shown vary according to the actual volume of data transmitted. For example, the units can be B, KiB, MiB, GiB, or TiB.
PACKETS RECEIVED: The total number of packets of data received by the network interface for each node.
PACKETS TRANSMITTED: The total number of packets of data transmitted by the network interface for each node.
DROPPED PACKETS: The total number of packets of data that failed to reach their destination per node.
ERRORS: The total number of packets that are damaged or have format errors per node.