Monitoring throughput and row count consistency for Kudu replication

Learn how to use Flink and Kudu metrics to verify record flow, detect backpressure, and perform coarse consistency checks between source and sink clusters.

Flink operator throughput

The primary goal of monitoring operator throughput is to confirm that records flow through the pipeline at the expected rate. The KuduSource output rate and the KuduSink input rate must track each other closely. A growing gap between these two metrics indicates backpressure or a slow writer.

Use the following Prometheus queries to validate the record flow:

Records per second emitted by KuduSource:

sum(flink_taskmanager_job_task_operator_numRecordsOutPerSecond{operator_name="Source:_KuduSource"})

Records per second consumed by KuduSink:

sum(flink_taskmanager_job_task_operator_numRecordsInPerSecond{operator_name="Sink:_Writer"})

Live row count consistency check

You can perform a coarse consistency check by comparing the approximate total row count on the source and sink tables. The sink row count must track the source count within one discovery interval. When you apply the cluster label in the scrape configuration, both clusters appear as separate series in a single query.

Run the following query to compare row counts:

sum(
  kudu_tablet_live_row_count
  * on(tablet_id) group_left(table_name)
  (max by (tablet_id, table_name) (kudu_tablet_info{table_name="<table-name>"}))
) by (cluster)

You must set the Grafana legend to {{cluster}}.

Monitoring multiple tables

Each replicated table requires a dedicated replication job. When you monitor multiple tables, you must update the table_name filter in the Grafana queries to match the specific table.

The kudu_tablet_info mapping metric covers all tables on the tablet server automatically. You do not need to make changes to the json_exporter or Prometheus configuration when you add new tables.