Replication monitoring and diagnostics

If you already installed Prometheus and Grafana, you can monitor your replication flows. When configuring Kafka cluster replication, replication connectors provide some additional metrics which are worth monitoring besides the underlying Kafka Connect cluster metrics.

For the complete list of replication connector related metrics, Monitoring Geo-Replication in the Apache Kafka documentation. In order to be able to access these metrics, you must configure the Connect JMX metrics exporter.

You can use the included kafka-connect-replication-metrics.yaml example file to create a Kafka Connect cluster which exports the necessary metrics. This example exports both replication related metrics as well as metrics about the underlying Kafka Connect cluster, which can be useful when monitoring replication flows.

Before applying the example file, you need to modify spec.bootstrapServers which should point to your target Kafka cluster. After deploying the replication connectors into this Kafka Connect cluster, the metrics will be available with the kafka_connect_mirror_ prefix. You can change the prefix by specifying different renaming rules in the JMX exporter configuration.

The following are some metrics that can be of interest when monitoring a replication:

kafka_connect_mirror_mirrorsourceconnector_byte_rate – Measures the Bytes/sec in replicated records through the source connector.
kafka_connect_mirror_mirrorsourceconnector_record_age_ms – Time duration between record timestamp in the source topic and the time when the source connector handles the record.
kafka_connect_mirror_mirrorsourceconnector_replication_latency_ms – Time duration it takes records to propagate from source to target. The difference between record timestamp in the source topic and the time when the producer receives ack from the target cluster that the record was written successfully.
kafka_connect_source_task_source_record_active_count – The number of records that this task has consumed from the source but not yet produced to the target.
kafka_connect_connector_task_offset_commit_avg_time_ms – Time duration that this task takes to commit its offsets to the target.
kafka_consumer_fetch_manager_records_lag – Consumer lag which in the context of the replication indicates whether the consumer in the source connector can keep up with the rate records are produced in the source.

A sample Grafana dashboard is provided in strimzi-kafka-connect-replication.json among the examples which configures visualizations of the above metrics. It can serve as a basis for monitoring replication flows. You can even use it for multiple replication flows, as you can choose the namespace and connect cluster which you want to monitor. You might want to tailor it to your specific needs by modifying or extending this dashboard.

The prometheus-rules.yaml contains some replication related alerting rules under the replication group. You might want to configure the exact thresholds based on your specific needs or define your own rules. It is also recommended to configure the alerting rules for Kafka Connect (connect group).