cdp-doctor system metrics

Scope

The cdp-doctor system metrics command provides a comprehensive snapshot of system-level resource utilization on a CDP node. It helps validate the node's disk usage, CPU performance, and network connection status, ensuring the system is healthy and not resource-constrained.

This command is often used during diagnostic checks, performance validation, and capacity monitoring on DataLakes, data hubs, and FreeIPA nodes.

Disk – Partitions
- Lists all mounted file systems, their total, used, and free space, along with utilization percentage.
- Helps identify storage bottlenecks or partitions nearing capacity.
Disk – Top Largest Folders in /var/log
- Displays the largest directories under /var/log to identify which services generate the most logs.
- Useful for troubleshooting log-related disk usage issues.
Network – Connections
- Summarizes active TCP connection states such as LISTEN, ESTABLISHED, TIMEWAIT, etc.
- Helps assess network load and connection health.
CPU – Times
- Shows CPU utilization percentages across different modes (idle, system, user, nice).
- Useful for understanding overall system load and performance.

Use Case

Performing pre-upgrade or health checks on cluster nodes.
Investigating performance degradation or disk alerts.
Validating system readiness during deployment or service restarts.

Sample Output

Running the cdp-doctor system metrics command displays the following output:

Disk - Partitions:
+----------------+---------------+--------+---------+---------+----------+---------+----------+---------+
|     Device     |  Mountpoint   | Fstype | Maxfile | Maxpath |  Total   |  Used   |   Free   | Percent |
+----------------+---------------+--------+---------+---------+----------+---------+----------+---------+
| /dev/nvme0n1p3 |       /       |  xfs   |   255   |  4096   | 299.8 GB | 90.9 GB | 208.9 GB |  30.3%  |
| /dev/nvme0n1p2 |   /boot/efi   |  vfat  |  1530   |  4096   | 199.8 MB | 5.8 MB  | 194.0 MB |  2.9%   |
|  /dev/nvme1n1  | /hadoopfs/fs1 |  ext4  |   255   |  4096   | 502.9 GB | 8.6 GB  | 494.3 GB |  1.7%   |
+----------------+---------------+--------+---------+---------+----------+---------+----------+---------+
Disk - Top largest folders in /var/log:
+--------------------------------+----------+
|              Path              |   Size   |
+--------------------------------+----------+
|      /var/log/solr-infra       |  2.0 GB  |
|        /var/log/ranger         | 833.5 MB |
|         /var/log/atlas         | 230.6 MB |
|  /var/log/cloudera-scm-server  | 220.4 MB |
|      /var/log/hadoop-hdfs      | 200.9 MB |
|         /var/log/salt          | 119.2 MB |
| /var/log/cloudera-scm-firehose | 104.1 MB |
|         /var/log/knox          | 103.5 MB |
|  /var/log/cdp_resources_check  | 94.5 MB  |
|  /var/log/cdp-request-signer   | 88.0 MB  |
+--------------------------------+----------+
Network - Connections:
+-------------+-----+
|   LISTEN    | 81  |
| ESTABLISHED | 700 |
|  TIMEWAIT   | 146 |
|  CLOSEWAIT  | 35  |
|   CLOSED    |  0  |
|   SYNSEND   |  0  |
| SYNRECEIVED |  0  |
|  FINWAIT1   |  0  |
|  FINWAIT2   |  0  |
|   LASTACK   |  0  |
+-------------+-----+
CPU - Times:
+--------+--------+
|  idle  | 80.7 % |
| system | 4.4 %  |
|  user  | 14.0 % |
|  nice  | 0.0 %  |
+--------+--------+

Disk usage over 80% may trigger warnings and require cleanup.
Large /var/log folders can indicate noisy services or misconfigured log rotation.
High CLOSE_WAIT or TIME_WAIT counts may suggest network/socket issues.
Low (<20%) CPU idle may indicate high load or resource pressure.