Reverse Checkpointing

The reverse checkpointing feature in Streams Replication Manager enables the tracking and replication of consumer offsets from a target cluster back to a source cluster. This ensures that progress made by consumer groups on a backup cluster is preserved and translated back to the primary cluster during a failback scenario.

Failback in standard replication scenarios

In standard replication scenarios, checkpointing supports the replication of committed offsets from an upstream (source) topic to a downstream (replica) topic. This allows consumers to fail over to a backup cluster and resume consumption without scanning the entire topic. However, without reverse checkpointing, the downstream-to-upstream direction is not supported. When a consumer group fails back to the original source cluster after processing data on the backup, the offsets committed on the backup are not translated back to the source.

This results in one of the following scenarios upon failback:

Missing offsets:: If no offset is recorded for the consumer group in the source cluster, the consumer resets to the earliest or latest offset, potentially leading to massive data duplication or skipped data. This can occur when the consumer group only started consuming from the backup cluster after the failover and never consumed from the source cluster.
Obsolete offsets:: The consumer resumes from an old offset recorded before the failover, which means that data that was consumed on the backup cluster is consumed again, leading to data duplication.

Reverse Checkpointing logic

Reverse checkpointing addresses the failback data duplication issue through offset translation in the downstream-to-upstream direction. To perform this translation, Streams Replication Manager needs bidirectional replication to be enabled. Then, in case of failback, the replication flow from the backup to the source can use the offset-syncs generated by the original source-to-backup flow as well as the reverse flow. This allows Streams Replication Manager to map the offsets from the replica topic back to the equivalent offsets in the source topic, and so minimize message duplication.

Limitations

Reverse checkpointing aims to minimize the volume of duplicate messages upon failback. However, it does not eliminate duplicates completely, which means that consumers need to be prepared to handle duplicates, even if you are using exactly-once schematic in Kafka and Streams Replication Manager. Also, the exact order of messages is not guaranteed to be preserved in every failover and failback scenario.