Managing Alert Policies using Streams Messaging ManagerPDF version

Component Types and Metrics for Alert Policies

You create an alert policy for a component type. The component type drives the list of metrics to select for creating a threshold.

The following table lists the component types and metrics for an alert policy:
Table 1. Component Types and Metrics
Component Type Metric Description Suggested Alert
Topic UNDER REPLICATED PARTITIONS COUNT Total number of partitions that are under replicated for a topic. Value > 0.
BYTES IN PER SEC Bytes per second coming in to a topic. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the topic becomes idle.
  • Alert-2: Value > max_bytes_in_expected, raises an alert when the topic input load is higher than usual.
BYTES OUT PER SEC Bytes per second going out from a topic. It does not count the internal replication traffic. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the topic becomes idle.
  • Alert-2: Value > max_bytes_out_expected, raises an alert when the topic output load is higher than usual.
OUT OF SYNC REPLICA COUNT Total number of replicas that are not in sync with the leader for a topic. Value > 0, raises an alert if there are out of sync replicas for the topic.
TOPIC PARTITION CONSUMPTION PERCENTAGE

Percentage of bytes consumed per topic partition compared according to the configured parameter retention.bytes. If retention.bytes is not configured, any condition involving this metric would be false.

Value > max_expected_value, raises an alert if the topic partition reaches a certain consumption percentage.
TOPIC PARTITION BYTES IN PER SEC Bytes per second coming in to a topic partition. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the topic partition becomes idle.
  • Alert-2: Value > max_bytes_in_expected, raises an alert when the topic partition input load is higher than usual.
TOPIC PARTITION BYTES OUT PER SEC Bytes per second coming out of a topic partition. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the topic partition becomes idle.
  • Alert-2: Value > max_bytes_out_expected, raises an alert when the topic partition output load is higher than usual.
Producer IS PRODUCER ACTIVE Checks whether a producer is active. Value is False.
MILLISECONDS LAPSED SINCE PRODUCER WAS ACTIVE Milliseconds passed since the producer was last active. Value > max_producer_idle_time, raises an alert if the producer did not produce for max_producer_idle_time ms.
Cluster ACTIVE CONTROLLER COUNT Number of brokers in the cluster reporting as the active controller in the last interval. Value != 1.
ONLINE BROKER COUNT Number of brokers that are currently online. Depends on the application.
For example, you can raise an alert if the number of brokers falls below the min.insync.replicas configured for the producer.
  • Alert-1: Value < min.insync.replicas, raises an alert when producer could not send any messages.
  • Alert-2: Value = min.insync.replicas, raises an alert that denotes if any one of the remaining brokers goes down, then producer would not be able to send messages.
UNCLEAN LEADER ELECTION COUNT Number of unclean partition leader elections in the cluster reported in the last interval. Value > 0.
UNDER REPLICATED PARTITIONS COUNT Total number of topic partitions in the cluster that are under replicated. Value > 0.
LEADER ELECTION PER SEC Rate of partition leader elections. Depends on the number of partitions in the application.
OFFLINE PARTITIONS COUNT Total number of topic partitions, in the cluster, that are offline. Value > 0.
NETWORK PROCESSOR AVG IDLE PERCENT Average fraction of time the network processor threads are idle across the cluster. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the network processor threads are busy.
  • Alert-2: Value > network_processor_idle_percentage, raises an alert when the network_processor_idle_percentage is higher than usual.
Note: Value range is 0 to 1.
REQUEST HANDLER POOL AVG IDLE PERCENT Average fraction of time the request handler threads are idle across the cluster. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the request handler threads are busy.
  • Alert-2: Value > request_handler_idle_percentage, raises an alert when the request_handler_idle_percentage is higher than usual.
Note: Value range is 0 to 1.
BROKER BYTES IN DEVIATION PERCENTAGE Percentage by which a broker bytes in per second has deviated from the average bytes in per second of all the alive brokers. Value > max_byte_in_deviation_percentage, raises an alert if a broker is seeing more than max_byte_in_deviation_percentage incoming traffic compared to average incoming traffic seen by all the brokers.
BROKER BYTES OUT DEVIATION PERCENTAGE Percentage by which a broker bytes out per second has deviated from the average bytes out per second of all the alive brokers. Value > max_byte_out_deviation_percentage, raises an alert if a broker is seeing more than max_byte_out_deviation_percentage outgoing traffic compared to average outgoing traffic seen by all the brokers.
ZOOKEEPER SESSION EXPIRATION PER SEC Average rate at which brokers are experiencing zookeeper session expiration per second. If this value is high, it can lead to controller fail over and leader changes. Raises an alert if value > 0.
Consumer CONSUMER GROUP LAG How far consumer groups are behind the producers. Depends on the application.
IS CONSUMER ACTIVE Checks whether a consumer is active. Value is False.
MILLISECONDS LAPSED SINCE CONSUMER WAS ACTIVE Milliseconds passed since the consumer was last active. Value > max_consumer_idle_time, raises an alert if the consumer did not consume for max_consumer_idle_time ms.
Broker BYTES IN PER SEC Number of bytes per second produced to a broker. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the broker becomes idle.
  • Alert-2: Value > max_bytes_in_expected_per_broker, raises an alert when the broker input load is higher than usual.
ZOOKEEPER SESSION EXPIRATION PER SEC Rate at which brokers are experiencing Zookeeper session expirations per second. If this value is high, it can lead to controller fail over and leader changes. Raises an alert if value > 0.
TOTAL PRODUCE REQUESTS PER SEC Total number of produce requests to a broker per second. Depends on the application. Two kinds of alert can be configured.
  • Alert-1: Value =0, raises an alert when there are no produce requests for the last 15 minutes.
  • Alert-2: Value > usual_num_producer_requests_expected to detect the spike in the number of requests.
PARTITION IMBALANCE PERCENTAGE The partition imbalance for a broker. It is calculated as: (abs(average_no_of_partitions_per_broker - actual_no_of_partitions_per_broker) / average_no_of_partitions_per_broker) * 100 Value > 10 %
BYTES OUT PER SEC Number of bytes per second fetched from a broker. It does not count the internal replication traffic. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the broker becomes idle.
  • Alert-2: Value > max_bytes_out_expected_per_broker, raises an alert when the broker output load is higher than usual.
IS BROKER DOWN Checks whether a broker is down. Value is True.
TOTAL PRODUCE REQUEST LATENCY Latency of produce requests to this broker at the 99th percentile (in ms). Value > max_expected_latency_ms.
ISR SHRINKS PER SEC Rate at which brokers are experiencing InSync Replica Shrinks (number of shrinks per second). Value > 0.
TOTAL FETCH CONSUMER REQUEST LATENCY Latency of fetch consumer requests to this broker at 99th percentile (in ms). Value > max_expected_latency_ms.
REQUEST HANDLER POOL AVG IDLE PERCENT Average fraction of time the request handler threads are idle. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the request handler threads are busy.
  • Alert-2: Value > request_handler_idle_percentage, raises an alert when the request_handler_idle_percentage is higher than usual.
Note: Value range is 0 to 1.
NETWORK PROCESSOR AVG IDLE PERCENT Average fraction of time the network processor threads are idle. Two kinds of alert can be configured.
  • Alert-1: Value = 0, raises an alert when the network processor threads are busy.
  • Alert-2: Value > network_processor_idle_percentage, raises an alert when the network_processor_idle_percentage is higher than usual.
Note: Value range is 0 to 1.
Cluster Replication REPLICATION LATENCY 15 minutes average replication latency in milliseconds. Value > max_expected_replication_latency, raises an alert if the replication latency is greater than max_expected_replication_latency.
REPLICATION THROUGHPUT 15 minutes average replication throughput in bytes per second. Value < min_expected_throughput, raises an alert if throughput during replication is low. This could happen because of network issues.
CHECKPOINT LATENCY 15 minutes average checkpoint latency in milliseconds. Value > max_expected_checkpoint_latency, raises an alert if the checkpoint latency is greater than max_expected_replication_latency.
REPLICATION STATUS Replication status of a replication pipeline. Value != ACTIVE, raises an alert if the replication is not active.
Latency END TO END LATENCY 15 minutes average of end to end latency in ms. Value > max_expected_latency, raises an alert if the end to end latency is greater than max_expected_latency.