cdp-doctor system connections

Scope

The cdp-doctor system connections command provides a summary of active network socket states on a CDP node.

It helps identify the current network connection load and communication health between services, nodes, and external systems (such as Cloudera Manager, Data Lake, or Data Hub).

This command essentially performs a netstat-style summary, showing the number of open, listening, or waiting to close network sockets, providing a quick view of node-level connectivity status.

The cdp-doctor system connections command gathers and summarizes TCP socket states for all running processes on the node, including:

System daemons (sshd, nginx, fluentd, etc.)
Cloudera Manager agent/server connections.
Service communication ports (e.g., Hive ↔ Ranger ↔ HDFS).
Salt minion/master communication.

Sample Output

Running the cdp-doctor system connections command displays the following output:

Connections - States:
+-------------+-----+
| ESTABLISHED | 722 |
|  TIME_WAIT  | 137 |
|   LISTEN    | 81  |
| CLOSE_WAIT  | 36  |
|  FIN_WAIT2  |  2  |
+-------------+-----+


State	Connections	Typical Interpretation	Description
ESTABLISHED	719	Normal for communication between cluster components and agents.	Active two-way connections between processes.
LISTEN	81	Indicates servers or daemons ready to accept connections (e.g., cm-server, nginx, sshd).	Services actively listening for new incoming connections.
TIME_WAIT	141	Normal, but large values may indicate high churn or frequent restarts.	Connections recently closed, waiting before reuse.
CLOSE_WAIT	32	A few are normal; large counts can indicate processes not properly releasing sockets.	The remote side has a closed connection; the local side has not yet been released.
FIN_WAIT2	4	Usually transient; persistent values may indicate ungraceful termination or stuck services.	Connection is shutting down, but not yet fully closed.

A high (>100) CLOSE_WAIT value may indicate a service not closing sockets properly. Check for stuck or zombie processes.
High (>1000) TIME_WAIT values are common during heavy data movement or restarts. Persistent high counts may affect performance.
A low (0 or less than expected) LISTEN count may indicate that service daemons may not have started or are misconfigured.