SRM Service Metrics

Metric Name Description Unit CDH Version
alerts_rate The number of alerts. events per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_cpu_system_rate CPU usage of the role's cgroup seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_cpu_user_rate User Space CPU usage of the role's cgroup seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_mem_page_cache Page cache usage of the role's cgroup bytes [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_mem_rss Resident memory of the role's cgroup bytes [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_mem_swap Swap usage of the role's cgroup bytes [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_read_bytes_rate Bytes read from all disks by the role's cgroup bytes per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_read_ios_rate Number of read I/O operations from all disks by the role's cgroup ios per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_write_bytes_rate Bytes written to all disks by the role's cgroup bytes per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cgroup_write_ios_rate Number of write I/O operations to all disks by the role's cgroup ios per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cpu_system_rate Total System CPU seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
cpu_user_rate Total CPU user time seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
events_critical_rate The number of critical events. events per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
events_important_rate The number of important events. events per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
events_informational_rate The number of informational events. events per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
fd_max Maximum number of file descriptors file descriptors [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
fd_open Open file descriptors. file descriptors [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
health_bad_rate Percentage of Time with Bad Health seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
health_concerning_rate Percentage of Time with Concerning Health seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
health_disabled_rate Percentage of Time with Disabled Health seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
health_good_rate Percentage of Time with Good Health seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
health_unknown_rate Percentage of Time with Unknown Health seconds per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
mem_rss Resident memory used bytes [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
mem_swap Amount of swap memory used by this role's process. bytes [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
mem_virtual Virtual memory used bytes [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
oom_exits_rate The number of times the role's backing process was killed due to an OutOfMemory error. This counter is only incremented if the Cloudera Manager "Kill When Out of Memory" option is enabled. exits per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
read_bytes_rate The number of bytes read from the device bytes per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
unexpected_exits_rate The number of times the role's backing process exited unexpectedly. exits per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
uptime For a host, the amount of time since the host was booted. For a role, the uptime of the backing process. seconds [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
write_bytes_rate The number of bytes written to the device bytes per second [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]
streams_replication_manager_metrics_processor_status_code The status code of the SRM service metrics processor. 0: HEALTHY, 1: INITIALIZING_METRICS_PROCESSOR, 2: RESTARTING_METRICS_PROCESSOR message.units.status_code [CDH 5.0.0..CDH 8.0.0)
streams_replication_manager_service_remote_service_discovery_endpoint_group_aggregated_status_code The aggregated status codes of the remote SRM Service discovery endpoint groups. These endpoint groups represent the discovered remote SRM Service clusters. For an endpoint group to be available, it needs to have at least 1 active member, and all members should advertise the same protocol. 0: all endpoint groups are available, non-zero: one or more endpoint groups are not available message.units.connection_status_code [CDH 5.0.0..CDH 8.0.0)
streams_replication_manager_service_remote_service_discovery_endpoint_group_health_check_aggregated_status_code The aggregated status codes of the remote SRM Service discovery endpoint groups validated with health checks. These endpoint groups represent the discovered remote SRM Service clusters, with their members being health-checked. For an endpoint group to be available, it needs to have at least 1 healthy member. 0: all endpoint groups are available, non-zero: one or more endpoint groups are not available message.units.connection_status_code [CDH 5.0.0..CDH 8.0.0)
streams_replication_manager_service_remote_service_discovery_topic_consumer_aggregated_status_code The aggregated status codes of the remote SRM Service discovery topic consumers. These consumers connect to the remote target Kafka clusters listed in the 'Streams Replication Manager Service Remote Target Clusters' configuration of SRM Service. 0: all consumers are connected to their corresponding remote target Kafka cluster, non-zero: one or more consumers are not connected message.units.connection_status_code [CDH 5.0.0..CDH 8.0.0)
streams_replication_manager_service_target_metrics_processor_aggregated_status_code The aggregated status code of the SRM service metrics processors. These metrics processors connect to the target Kafka clusters listed in the 'Streams Replication Manager Service Target Clusters' 0: all metrics processors are connected to a target Kafka cluster and working as expected, non-zero: one or more metrics processors are either restarting or initializing message.units.status_code [CDH 5.0.0..CDH 8.0.0)
streams_replication_manager_service_target_metrics_streams_application_kafka_connection_aggregated_status_code The aggregated status code of the SRM service metrics Streams Applications. These Streams Applications connect to the target Kafka clusters listed in the 'Streams Replication Manager Service Target Clusters' 0: all metrics Streams Application are connected to a target Kafka cluster, non-zero: one or more metrics Streams Application are not connected message.units.connection_status_code [CDH 5.0.0..CDH 8.0.0)
streams_replication_manager_service_target_service_discovery_heartbeat_producer_aggregated_status_code The aggregated status codes of the service discovery heartbeat producers. These producers connect to the target Kafka clusters listed in the 'Streams Replication Manager Service Target Clusters' configuration of SRM Service. 0: all producers are connected to their corresponding target Kafka cluster, non-zero: one or more producers are not connected message.units.connection_status_code [CDH 5.0.0..CDH 8.0.0)
streams_replication_manager_streams_kafka_connection_status_code The status code of the Streams App Kafka Connection. 0: CONNECTED, 1: DISCONNECTED message.units.connection_status_code [CDH 5.0.0..CDH 8.0.0)