Replicating Kudu tables using Apache Flink
Learn about using the Apache Flink-based replication job to continuously replicate data from a source Kudu cluster to a sink Kudu table.
The Kudu replication job is an Apache Flink streaming application that continuously copies data from a source Kudu table to a sink Kudu table. It uses the Kudu diff scan capability to detect and replicate only the rows that changed between two points in time. This is more efficient than scanning the full table on every cycle.
Replication mechanism
The replication mechanism follows these steps:
- On the first startup, the job performs a full snapshot scan of the source table and records the snapshot timestamp
t0. - On each subsequent discovery cycle, it performs a diff scan over the interval
[t_previous, t_now]. This returns all inserted, updated, and deleted rows in that window. - Deleted rows are identified by using the Kudu
IS_DELETEDvirtual column and are replicated as delete operations on the sink. - All writes to the sink use upsert-ignore semantics, which makes replication idempotent and compatible with at-least-once Flink checkpointing.
Flink checkpoints occur regularly to enable fault-tolerant recovery. If a restart occurs from a checkpoint, any splits that were in-flight at the time of the checkpoint are reprocessed. Because sink writes are idempotent, reprocessing is safe.
Known limitations
- Schema changes: These are not applied automatically. You must follow the manual schema change procedure strictly.
- Unsupported types: The
ARRAYcolumn type is not currently supported. You cannot replicate tables containing array columns. - Table renames: Renaming a replicated table requires a specific coordinated procedure.
