Monitoring replication job

Learn how to visualize the health of the replication pipeline by configuring Prometheus and Grafana.

To visualize the health of the replication pipeline, you must configure Prometheus and Grafana.

Configuring Prometheus

Enable Prometheus metrics reporting in Flink by adding the following to the conf/config.yaml file:

metrics:
  reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
  reporter.prom.port: 9250-9260

Normalize Kudu tablet server metrics by using Prometheus metric relabeling and the json_exporter tool. This ensures metrics are stable and queryable by table name.
Use the provided scrape configurations to add a cluster label (such as source or sink) to all Kudu targets.

Import the provided Grafana dashboard to monitor the following areas:

Job health: Uptime, restart counts, and checkpoint durations.
Replication lag: The difference between the current time and the lastEndTimestamp.
Write activity: The rate of each write operation type on the source and sink clusters.
Throughput: The number of records per second emitted by the source and consumed by the sink.