Component Types and Metrics for Alert Policies
You create an alert policy for a component type. The component type drives the list of metrics to select for creating a threshold.
The following table lists the component types and metrics for an alert policy:
Component Type | Metric | Description | Suggested Alert |
---|---|---|---|
Topic | UNDER REPLICATED PARTITIONS COUNT | Total number of partitions that are under replicated for a topic. | Value > 0. |
BYTES IN PER SEC | Bytes per second coming in to a topic. | Two kinds of alert can be configured.
|
|
BYTES OUT PER SEC | Bytes per second going out from a topic. It does not count the internal replication traffic. | Two kinds of alert can be configured.
|
|
OUT OF SYNC REPLICA COUNT | Total number of replicas that are not in sync with the leader for a topic. | Value > 0, raises an alert if there are out of sync replicas for the topic. | |
TOPIC PARTITION CONSUMPTION PERCENTAGE |
Percentage of bytes consumed per topic partition compared according to the
configured parameter |
Value > max_expected_value, raises an alert if the topic partition reaches a certain consumption percentage. | |
TOPIC PARTITION BYTES IN PER SEC | Bytes per second coming in to a topic partition. | Two kinds of alert can be configured.
|
|
TOPIC PARTITION BYTES OUT PER SEC | Bytes per second coming out of a topic partition. | Two kinds of alert can be configured.
|
|
Producer | IS PRODUCER ACTIVE | Checks whether a producer is active. | Value is False. |
MILLISECONDS LAPSED SINCE PRODUCER WAS ACTIVE | Milliseconds passed since the producer was last active. | Value > max_producer_idle_time, raises an alert if the producer did not produce for max_producer_idle_time ms. | |
Cluster | ACTIVE CONTROLLER COUNT | Number of brokers in the cluster reporting as the active controller in the last interval. | Value != 1. |
ONLINE BROKER COUNT | Number of brokers that are currently online. | Depends on the application. For example, you can raise an alert if the number
of brokers falls below the min.insync.replicas configured for the
producer.
|
|
UNCLEAN LEADER ELECTION COUNT | Number of unclean partition leader elections in the cluster reported in the last interval. | Value > 0. | |
UNDER REPLICATED PARTITIONS COUNT | Total number of topic partitions in the cluster that are under replicated. | Value > 0. | |
LEADER ELECTION PER SEC | Rate of partition leader elections. | Depends on the number of partitions in the application. | |
OFFLINE PARTITIONS COUNT | Total number of topic partitions, in the cluster, that are offline. | Value > 0. | |
NETWORK PROCESSOR AVG IDLE PERCENT | Average fraction of time the network processor threads are idle across the cluster. | Two kinds of alert can be configured.
|
|
REQUEST HANDLER POOL AVG IDLE PERCENT | Average fraction of time the request handler threads are idle across the cluster. | Two kinds of alert can be configured.
|
|
BROKER BYTES IN DEVIATION PERCENTAGE | Percentage by which a broker bytes in per second has deviated from the average bytes in per second of all the alive brokers. | Value > max_byte_in_deviation_percentage, raises an alert if a broker is seeing more than max_byte_in_deviation_percentage incoming traffic compared to average incoming traffic seen by all the brokers. | |
BROKER BYTES OUT DEVIATION PERCENTAGE | Percentage by which a broker bytes out per second has deviated from the average bytes out per second of all the alive brokers. | Value > max_byte_out_deviation_percentage, raises an alert if a broker is seeing more than max_byte_out_deviation_percentage outgoing traffic compared to average outgoing traffic seen by all the brokers. | |
ZOOKEEPER SESSION EXPIRATION PER SEC | Average rate at which brokers are experiencing zookeeper session expiration per second. | If this value is high, it can lead to controller fail over and leader changes. Raises an alert if value > 0. | |
Consumer | CONSUMER GROUP LAG | How far consumer groups are behind the producers. | Depends on the application. |
IS CONSUMER ACTIVE | Checks whether a consumer is active. | Value is False. | |
MILLISECONDS LAPSED SINCE CONSUMER WAS ACTIVE | Milliseconds passed since the consumer was last active. | Value > max_consumer_idle_time, raises an alert if the consumer did not consume for max_consumer_idle_time ms. | |
Broker | BYTES IN PER SEC | Number of bytes per second produced to a broker. | Two kinds of alert can be configured.
|
ZOOKEEPER SESSION EXPIRATION PER SEC | Rate at which brokers are experiencing Zookeeper session expirations per second. | If this value is high, it can lead to controller fail over and leader changes. Raises an alert if value > 0. | |
TOTAL PRODUCE REQUESTS PER SEC | Total number of produce requests to a broker per second. | Depends on the application. Two kinds of alert can be configured.
|
|
PARTITION IMBALANCE PERCENTAGE | The partition imbalance for a broker. It is calculated as:
(abs(average_no_of_partitions_per_broker -
actual_no_of_partitions_per_broker) / average_no_of_partitions_per_broker) *
100 |
Value > 10 % | |
BYTES OUT PER SEC | Number of bytes per second fetched from a broker. It does not count the internal replication traffic. | Two kinds of alert can be configured.
|
|
IS BROKER DOWN | Checks whether a broker is down. | Value is True. | |
TOTAL PRODUCE REQUEST LATENCY | Latency of produce requests to this broker at the 99th percentile (in ms). | Value > max_expected_latency_ms. | |
ISR SHRINKS PER SEC | Rate at which brokers are experiencing InSync Replica Shrinks (number of shrinks per second). | Value > 0. | |
TOTAL FETCH CONSUMER REQUEST LATENCY | Latency of fetch consumer requests to this broker at 99th percentile (in ms). | Value > max_expected_latency_ms. | |
REQUEST HANDLER POOL AVG IDLE PERCENT | Average fraction of time the request handler threads are idle. | Two kinds of alert can be configured.
|
|
NETWORK PROCESSOR AVG IDLE PERCENT | Average fraction of time the network processor threads are idle. | Two kinds of alert can be configured.
|
|
Cluster Replication | REPLICATION LATENCY | 15 minutes average replication latency in milliseconds. | Value > max_expected_replication_latency, raises an alert if the replication latency is greater than max_expected_replication_latency. |
REPLICATION THROUGHPUT | 15 minutes average replication throughput in bytes per second. | Value < min_expected_throughput, raises an alert if throughput during replication is low. This could happen because of network issues. | |
CHECKPOINT LATENCY | 15 minutes average checkpoint latency in milliseconds. | Value > max_expected_checkpoint_latency, raises an alert if the checkpoint latency is greater than max_expected_replication_latency. | |
REPLICATION STATUS | Replication status of a replication pipeline. | Value != ACTIVE, raises an alert if the replication is not active. | |
Latency | END TO END LATENCY | 15 minutes average of end to end latency in ms. | Value > max_expected_latency, raises an alert if the end to end latency is greater than max_expected_latency. |