Grafana dashboard for Kudu replication

Learn about the ready-to-import Grafana dashboard for Kudu replication and how to use specific Prometheus queries to monitor Flink job health.

Accessing the replication dashboard

A ready-to-import Grafana dashboard is available in the Kudu repository at the following location:

examples/flink-replication/monitoring/grafana/dashboards/replication.json

The queries described in this topic illustrate the data each panel displays and provide working examples from that dashboard. These examples assume you have configured the Prometheus scrape configuration with cluster labels on Kudu targets. You must adapt label selectors and metric names as needed for your specific environment, as other methods to express these queries exist.

Monitoring job health

The primary goal of monitoring job health is to confirm that the Flink job is running, has not restarted unexpectedly, and is performing checkpoints normally.

Table 1. Prometheus queries for monitoring replication job health
Goal	Prometheus query	Description
Job uptime	`flink_jobmanager_job_uptime`	Measures the time since the job last started or restarted.
Restart count	`flink_jobmanager_job_fullRestarts`	Tracks the number of full restarts. In a steady state, this value must be 0.
Checkpoint duration	`flink_jobmanager_job_lastCheckpointDuration`	Measures the duration of the most recent checkpoint.