Component types and metrics for alert policies
You create an alert policy for a component type. The component type drives the list of metrics to select for creating a threshold. Learn the different component types and supported metrics for each component type.
The following table lists the component types and metrics for an alert policy:
Metric | Description | Suggested Alert |
---|---|---|
Topic | ||
UNDER REPLICATED PARTITIONS COUNT | Total number of partitions that are under replicated for a topic. | Value > 0. |
BYTES IN PER SEC | Bytes per second coming in to a topic. | Two kinds of alert can be configured.
|
BYTES OUT PER SEC | Bytes per second going out from a topic. It does not count the internal replication traffic. | Two kinds of alert can be configured.
|
OUT OF SYNC REPLICA COUNT | Total number of replicas that are not in sync with the leader for a topic. | Value > 0, raises an alert if there are out of sync replicas for the topic. |
TOPIC PARTITION CONSUMPTION PERCENTAGE |
Percentage of bytes consumed per topic partition compared according to the
configured parameter |
Value > max_expected_value, raises an alert if the topic partition reaches a certain consumption percentage. |
TOPIC PARTITION BYTES IN PER SEC | Bytes per second coming in to a topic partition. | Two kinds of alert can be configured.
|
TOPIC PARTITION BYTES OUT PER SEC | Bytes per second coming out of a topic partition. | Two kinds of alert can be configured.
|
Producer | ||
IS PRODUCER ACTIVE | Checks whether a producer is active. | Value is False. |
MILLISECONDS LAPSED SINCE PRODUCER WAS ACTIVE | Milliseconds passed since the producer was last active. | Value > max_producer_idle_time, raises an alert if the producer did not produce for max_producer_idle_time ms. |
Cluster | ||
ACTIVE CONTROLLER COUNT | Number of brokers in the cluster reporting as the active controller in the last interval. | Value != 1. |
ONLINE BROKER COUNT | Number of brokers that are currently online. | Depends on the application. For example, you can raise an alert if the number
of brokers falls below the min.insync.replicas configured for the
producer.
|
UNCLEAN LEADER ELECTION COUNT | Number of unclean partition leader elections in the cluster reported in the last interval. | Value > 0. |
UNDER REPLICATED PARTITIONS COUNT | Total number of topic partitions in the cluster that are under replicated. | Value > 0. |
LEADER ELECTION PER SEC | Rate of partition leader elections. | Depends on the number of partitions in the application. |
OFFLINE PARTITIONS COUNT | Total number of topic partitions, in the cluster, that are offline. | Value > 0. |
NETWORK PROCESSOR AVG IDLE PERCENT | Average fraction of time the network processor threads are idle across the cluster. | Two kinds of alert can be configured.
|
REQUEST HANDLER POOL AVG IDLE PERCENT | Average fraction of time the request handler threads are idle across the cluster. | Two kinds of alert can be configured.
|
BROKER BYTES IN DEVIATION PERCENTAGE | Percentage by which a broker bytes in per second has deviated from the average bytes in per second of all the alive brokers. | Value > max_byte_in_deviation_percentage, raises an alert if a broker is seeing more than max_byte_in_deviation_percentage incoming traffic compared to average incoming traffic seen by all the brokers. |
BROKER BYTES OUT DEVIATION PERCENTAGE | Percentage by which a broker bytes out per second has deviated from the average bytes out per second of all the alive brokers. | Value > max_byte_out_deviation_percentage, raises an alert if a broker is seeing more than max_byte_out_deviation_percentage outgoing traffic compared to average outgoing traffic seen by all the brokers. |
ZOOKEEPER SESSION EXPIRATION PER SEC | Average rate at which brokers are experiencing zookeeper session expiration per second. | If this value is high, it can lead to controller fail over and leader changes. Raises an alert if value > 0. |
Consumer | ||
CONSUMER GROUP LAG | How far consumer groups are behind the producers. | Depends on the application. |
IS CONSUMER ACTIVE | Checks whether a consumer is active. | Value is False. |
MILLISECONDS LAPSED SINCE CONSUMER WAS ACTIVE | Milliseconds passed since the consumer was last active. | Value > max_consumer_idle_time, raises an alert if the consumer did not consume for max_consumer_idle_time ms. |
Broker | ||
BYTES IN PER SEC | Number of bytes per second produced to a broker. | Two kinds of alert can be configured.
|
ZOOKEEPER SESSION EXPIRATION PER SEC | Rate at which brokers are experiencing Zookeeper session expirations per second. | If this value is high, it can lead to controller fail over and leader changes. Raises an alert if value > 0. |
TOTAL PRODUCE REQUESTS PER SEC | Total number of produce requests to a broker per second. | Depends on the application. Two kinds of alert can be configured.
|
PARTITION IMBALANCE PERCENTAGE | The partition imbalance for a broker. It is calculated as:
(abs(average_no_of_partitions_per_broker -
actual_no_of_partitions_per_broker) / average_no_of_partitions_per_broker) *
100 |
Value > 10 % |
BYTES OUT PER SEC | Number of bytes per second fetched from a broker. It does not count the internal replication traffic. | Two kinds of alert can be configured.
|
IS BROKER DOWN | Checks whether a broker is down. | Value is True. |
TOTAL PRODUCE REQUEST LATENCY | Latency of produce requests to this broker at the 99th percentile (in ms). | Value > max_expected_latency_ms. |
ISR SHRINKS PER SEC | Rate at which brokers are experiencing InSync Replica Shrinks (number of shrinks per second). | Value > 0. |
TOTAL FETCH CONSUMER REQUEST LATENCY | Latency of fetch consumer requests to this broker at 99th percentile (in ms). | Value > max_expected_latency_ms. |
REQUEST HANDLER POOL AVG IDLE PERCENT | Average fraction of time the request handler threads are idle. | Two kinds of alert can be configured.
|
NETWORK PROCESSOR AVG IDLE PERCENT | Average fraction of time the network processor threads are idle. | Two kinds of alert can be configured.
|
Cluster Replication | ||
REPLICATION LATENCY | 15 minutes average replication latency in milliseconds. | Value > max_expected_replication_latency, raises an alert if the replication latency is greater than max_expected_replication_latency. |
REPLICATION THROUGHPUT | 15 minutes average replication throughput in bytes per second. | Value < min_expected_throughput, raises an alert if throughput during replication is low. This could happen because of network issues. |
CHECKPOINT LATENCY | 15 minutes average checkpoint latency in milliseconds. | Value > max_expected_checkpoint_latency, raises an alert if the checkpoint latency is greater than max_expected_replication_latency. |
REPLICATION STATUS | Replication status of a replication pipeline. | Value != ACTIVE, raises an alert if the replication is not active. |
Latency | ||
END TO END LATENCY | 15 minutes average of end to end latency in ms. | Value > max_expected_latency, raises an alert if the end to end latency is greater than max_expected_latency. |