SRM Service Metrics

Reference information for SRM Service Metrics

In addition to these base metrics, many aggregate metrics are available. If an entity type has parents defined, you can formulate all possible aggregate metrics using the formula base_metric_across_parents.

In addition, metrics for aggregate totals can be formed by adding the prefix total_ to the front of the metric name.

Use the type-ahead feature in the Cloudera Manager chart browser to find the exact aggregate metric name, in case the plural form does not end in "s".

For example, the following metric names may be valid for SRM Service:

alerts_rate_across_clusters
total_alerts_rate_across_clusters

Some metrics, such as alerts_rate, apply to nearly every metric context. Others only apply to a certain service or role.

alerts_rate

Description: The number of alerts.
Unit: events per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_cpu_system_rate

Description: CPU usage of the role's cgroup
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_cpu_user_rate

Description: User Space CPU usage of the role's cgroup
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_mem_page_cache

Description: Page cache usage of the role's cgroup
Unit: bytes
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_mem_rss

Description: Resident memory of the role's cgroup
Unit: bytes
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_mem_swap

Description: Swap usage of the role's cgroup
Unit: bytes
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_read_bytes_rate

Description: Bytes read from all disks by the role's cgroup
Unit: bytes per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_read_ios_rate

Description: Number of read I/O operations from all disks by the role's cgroup
Unit: ios per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_write_bytes_rate

Description: Bytes written to all disks by the role's cgroup
Unit: bytes per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cgroup_write_ios_rate

Description: Number of write I/O operations to all disks by the role's cgroup
Unit: ios per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cpu_system_rate

Description: Total System CPU
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

cpu_user_rate

Description: Total CPU user time
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

events_critical_rate

Description: The number of critical events.
Unit: events per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

events_important_rate

Description: The number of important events.
Unit: events per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

events_informational_rate

Description: The number of informational events.
Unit: events per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

fd_max

Description: Maximum number of file descriptors
Unit: file descriptors
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

fd_open

Description: Open file descriptors.
Unit: file descriptors
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

health_bad_rate

Description: Percentage of Time with Bad Health
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

health_concerning_rate

Description: Percentage of Time with Concerning Health
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

health_disabled_rate

Description: Percentage of Time with Disabled Health
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

health_good_rate

Description: Percentage of Time with Good Health
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

health_unknown_rate

Description: Percentage of Time with Unknown Health
Unit: seconds per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

mem_rss

Description: Resident memory used
Unit: bytes
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

mem_swap

Description: Amount of swap memory used by this role's process.
Unit: bytes
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

mem_virtual

Description: Virtual memory used
Unit: bytes
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

oom_exits_rate

Description: The number of times the role's backing process was killed due to an OutOfMemory error. This counter is only incremented if the Cloudera Manager "Kill When Out of Memory" option is enabled.
Unit: exits per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

read_bytes_rate

Description: The number of bytes read from the device
Unit: bytes per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

unexpected_exits_rate

Description: The number of times the role's backing process exited unexpectedly.
Unit: exits per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

uptime

Description: For a host, the amount of time since the host was booted. For a role, the uptime of the backing process.
Unit: seconds
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

write_bytes_rate

Description: The number of bytes written to the device
Unit: bytes per second
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 6.0.0), [CDH 6.0.0..CDH 7.0.0), [CDH 7.0.0..CDH 8.0.0), [CM -1.0.0..CM -1.0.0]

streams_replication_manager_metrics_processor_status_code

Description: The status code of the SRM service metrics processor. 0: HEALTHY, 1: INITIALIZING_METRICS_PROCESSOR, 2: RESTARTING_METRICS_PROCESSOR
Unit: message.units.status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)

streams_replication_manager_service_remote_service_discovery_endpoint_group_aggregated_status_code

Description: The aggregated status codes of the remote SRM Service discovery endpoint groups. These endpoint groups represent the discovered remote SRM Service clusters. For an endpoint group to be available, it needs to have at least 1 active member, and all members should advertise the same protocol. 0: all endpoint groups are available, non-zero: one or more endpoint groups are not available
Unit: message.units.connection_status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)

streams_replication_manager_service_remote_service_discovery_endpoint_group_health_check_aggregated_status_code

Description: The aggregated status codes of the remote SRM Service discovery endpoint groups validated with health checks. These endpoint groups represent the discovered remote SRM Service clusters, with their members being health-checked. For an endpoint group to be available, it needs to have at least 1 healthy member. 0: all endpoint groups are available, non-zero: one or more endpoint groups are not available
Unit: message.units.connection_status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)

streams_replication_manager_service_remote_service_discovery_topic_consumer_aggregated_status_code

Description: The aggregated status codes of the remote SRM Service discovery topic consumers. These consumers connect to the remote target Kafka clusters listed in the 'Streams Replication Manager Service Remote Target Clusters' configuration of SRM Service. 0: all consumers are connected to their corresponding remote target Kafka cluster, non-zero: one or more consumers are not connected
Unit: message.units.connection_status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)

streams_replication_manager_service_target_metrics_processor_aggregated_status_code

Description: The aggregated status code of the SRM service metrics processors. These metrics processors connect to the target Kafka clusters listed in the 'Streams Replication Manager Service Target Clusters' 0: all metrics processors are connected to a target Kafka cluster and working as expected, non-zero: one or more metrics processors are either restarting or initializing
Unit: message.units.status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)

streams_replication_manager_service_target_metrics_streams_application_kafka_connection_aggregated_status_code

Description: The aggregated status code of the SRM service metrics Streams Applications. These Streams Applications connect to the target Kafka clusters listed in the 'Streams Replication Manager Service Target Clusters' 0: all metrics Streams Application are connected to a target Kafka cluster, non-zero: one or more metrics Streams Application are not connected
Unit: message.units.connection_status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)

streams_replication_manager_service_target_service_discovery_heartbeat_producer_aggregated_status_code

Description: The aggregated status codes of the service discovery heartbeat producers. These producers connect to the target Kafka clusters listed in the 'Streams Replication Manager Service Target Clusters' configuration of SRM Service. 0: all producers are connected to their corresponding target Kafka cluster, non-zero: one or more producers are not connected
Unit: message.units.connection_status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)

streams_replication_manager_streams_kafka_connection_status_code

Description: The status code of the Streams App Kafka Connection. 0: CONNECTED, 1: DISCONNECTED
Unit: message.units.connection_status_code
Parents: cluster, rack, streams_replication_manager
CDH Version: [CDH 5.0.0..CDH 8.0.0)