Replication lag monitoring for Kudu

Learn how to monitor replication lag for Kudu by using Prometheus queries to verify that the enumerator is completing discovery cycles regularly.

Understanding replication lag

The primary goal of monitoring replication lag is to verify that the enumerator is completing diff scan cycles at the expected intervals. In a steady state, the lag oscillates between 0 and approximately the value of the job.discoveryIntervalSeconds property divided by 60 minutes. This creates a sawtooth wave pattern in your monitoring tool.

If the lag value climbs continuously without resetting to zero, it indicates a stalled enumerator that requires investigation.

Prometheus queries for replication lag

You can use the following queries to monitor replication lag in Grafana or Prometheus:

Instantaneous lag: Use this query for a stat panel to see the current lag in minutes:
```
(time() - (flink_jobmanager_job_operator_coordinator_enumerator_lastEndTimestamp / 1000)) / 60
```
Lag as a labeled time series: Use the timestamp() function so the result carries a label and can be plotted as a time series panel in Grafana:
```
(time() - (flink_jobmanager_job_operator_coordinator_enumerator_lastEndTimestamp / 1000)) / 60
```

Lag as a labeled time series: Use the timestamp() function so the result carries a label and can be plotted as a time series panel in Grafana:

(timestamp(flink_jobmanager_job_operator_coordinator_enumerator_lastEndTimestamp)
- (flink_jobmanager_job_operator_coordinator_enumerator_lastEndTimestamp / 1000)) / 60