Kudu replication Pre-stop checklist

Use the following checklist to ensure the Kudu replication pipeline is fully drained and no data is lost before you stop the job.

Before you stop the replication job for reasons such as a schema change, planned failover, or maintenance, you must follow this sequence to ensure the pipeline is fully drained.

  1. Stop all writes to the source table.

    This is an application-level step. You must quiesce or redirect the write path before you proceed.

  2. Verify no writes are in-flight on the source cluster.

    Run the write activity PromQL query described in the topic. Proceed only when the query returns 0.

  3. Wait for replication lag to normalize.

    Run the replication lag query. The lag must reach approximately the value of the job.discoveryIntervalSeconds property divided by 60 minutes. This confirms that the final diff scan captured all pending changes.

  4. Verify the sink cluster has fully drained.

    Ensure the pendingCount, unassignedCount, and pendingRemovalCount metrics are all 0 between discovery cycles.

  5. Verify no write activity exists on the sink cluster.

    Run the write activity PromQL query for the sink table to confirm that the replication job has flushed all buffered operations. Proceed only when the query returns 0.

  6. Stop the job by using a savepoint.

    Run the following command:

    flink stop -p hdfs:///kudu-replication/savepoints/my_table <job-id>

    The command displays the savepoint path upon completion:

    Savepoint completed. Path: hdfs:///kudu-replication/savepoints/my_table/savepoint-<id>