Failover and monitoring for Kudu replication
Learn how to perform a disaster recovery cutover, monitor replication health by using Flink metrics, and configure Prometheus and Grafana for visualization.
When you use a replicated sink cluster as a disaster recovery (DR) target, you must follow these procedures to promote the sink cluster to active status.
Planned failover (Source cluster available)
Use this procedure if the source cluster is reachable and you must cut over with minimal data loss.
- Follow the Pre-stop checklist to drain the pipeline.
- Stop the replication job by using a savepoint.
- Enable write access on the sink cluster by updating the Ranger policy.
- Redirect application traffic to the sink cluster by updating DNS records or connection strings.
Unplanned failover (Source cluster unavailable)
Use this procedure if the source cluster fails unexpectedly.
- Assess the replication lag at the time of failure by checking the
lastEndTimestampvalue in Grafana or Prometheus. The data loss window equals the replication lag at the time of failure. - Stop the replication job. If the job is unreachable, kill the YARN application by using
the ResourceManager UI or the
yarn application -killcommand. - Enable write access on the sink cluster by updating the Ranger policy.
- Redirect application traffic to the sink cluster.
Kudu replication metrics reference
The replication job exposes custom metrics through the Flink metrics system. In a steady state, all split metrics must be zero. A continuously advancing lastEndTimestamp indicates healthy replication.
| Metric Name | Type | Description |
|---|---|---|
lastEndTimestamp |
Gauge (Long) | The end timestamp of the most recently completed diff scan. |
pendingCount |
Gauge (Integer) | The number of scan splits assigned to readers but not yet fully processed. |
unassignedCount |
Gauge (Integer) | The number of scan splits waiting for assignment to a reader. |
pendingRemovalCount |
Gauge (Integer) | The number of completed splits deferred for removal until the next Flink checkpoint completes. |
