Performing a failover or failback

Learn about failover and failback operations that you can perform between two Kafka clusters that have data replication enabled. Performing a failback or failover operation enables you to migrate consumer and producer applications between Kafka clusters. These operations are typically performed after a disaster event or in migration scenarios.

Figure 1. Failover/failback setup

The producer and consumer applications both connect to the source cluster, while a Kafka Connect cluster is configured to replicate the business topics and synchronize the group offsets into the target cluster. Note that the business_topic in the target cluster is not created by replication. Instead you create this topic in preparation for the failover or failback scenario.

There are multiple types of failover and failback operations that you can carry out. Which one you perform depends on your scenario and use case. The failover and failback types are as follows.

Continuous and controlled failover
A continuous and controlled failover is carried out when all applications and services are working as expected, but you want to move workloads from one cluster to another. This type of failover is continuous because applications are moved continuously to the target without a cutoff. This failover can be performed rapidly and comes with minimal service disruptions.

This failover type works with DefaultReplicationPolicy only.

Controlled failover with a cutoff
A controlled failover with a cutoff is carried out when all applications and services are working as expected. The cutoff means that producers are stopped for the duration of the failover and consumer traffic is exhausted in the source cluster.

Compared to a continuous failover, this failover is more complex, but does not rely on group offset syncing, and can also guarantee message ordering for consumers even across the failover.

This failover type works with both the DefaultReplicationPolicy and IdentityReplicationPolicy.

Failover on disaster
A failover on disaster is carried out when you encounter a disaster scenario where your source cluster becomes unavailable. A failover on a disaster simply consists of reconfiguring and restarting your client applications to use the target Kafka cluster.
Controlled failback
A controlled failback is the same as a failover operation but in a reverse order. That is, you move clients back to their original cluster. A failback operation assumes that you already performed a failover operation.

Performing a continuous and controlled failover

Learn how to perform a continuous and controlled failover between Kafka clusters that have data replication enabled.

A continuous and controlled failover is carried out when all applications and services are working as expected. That is, there is no disaster scenario. Instead you make an executive decision to move your workload from the source cluster to the target cluster so that you can stop the source cluster, either temporarily or permanently, without disrupting applications.

The failover is continuous because applications can be continuously moved to the target cluster without a strict cutoff. Because of this, the failover can be performed rapidly with minimal service disruptions.

Throughout this process, replication of Kafka data is not stopped, ensuring that no data is lost.

Ensure that you are familiar with the process of checking replication state. See Checking the state of data replication .
  1. Fail over consumers.
    1. Gracefully stop consumers.
      This allows the consumers to commit their offsets to the source Kafka cluster of their latest state.
    2. Wait for the replication to successfully synchronize the latest offsets.
      Calculate wait time based on the intervals configured in the emit.checkpoints.interval.seconds (default 60 seconds) and synch.group.offsets.interval.seconds (default 60 seconds) properties of the MirrorCheckpointConnector. The wait time is the sum of these properties multiplied by two.
      wait time = 2 * (emit.checkpoints.interval.seconds + sync.group.offsets.interval.seconds)
    3. Configure consumers to connect to the target cluster and to consume from both the replicated and the active (prefixless) topic in the target cluster.
      There is a possibility that consumers still did not process all messages from the source cluster. To pick up the remaining data, they need to consume from the prefixed replica topics as well as from the active (prefixless) topic so that they also see the new data produced to the target cluster when the producers are failed over.
    4. Start consumers.
  2. Fail over producers.
    1. Gracefully stop producers.
    2. Configure producers to connect to the target cluster.
      Producers can safely produce to the exact same topics without any name changes as the replicated data is stored in a prefixed topic.
    3. Start producers.
  3. Wait for the replication fo finish replicating all data that was produced to the cluster.
    At this point, it is still possible that not all records are migrated to the target cluster. Check the state of the replication to ensure that all records are fully replicated.
  4. Stop the source cluster.

Performing a controlled failover with a cutoff

Learn how to perform a controlled failover with a cutoff between Kafka clusters that have data replication enabled.

A controlled failover with a cutoff is carried out when all applications and services are working as expected. That is, there is no disaster scenario. Instead you make an executive decision to stop the source cluster, either temporarily or permanently, and move your workload from the source to the target cluster.

The failover has a cutoff because producers are stopped for the duration of the failover. Additionally, all consumer traffic is exhausted in the source cluster. This results in a longer disruption in client applications.

A controlled failover with a cutoff is a complex process, but does not rely on group offset syncing, and can also guarantee message ordering for consumers even across the failover.
Ensure that you are familiar with the process of checking replication state. See Checking the state of data replication .
  1. Gracefully stop producers.
    This stops the ingress traffic, allowing all consumers to fully read all data.
  2. Monitor the consumers and the replication, and wait until all data is read.
    • To monitor the consumer applications, use the kafka-consumer-groups.sh Kafka tool with the --describe option. Wait until the lag becomes 0.
    • To check replication state, compare source offsets and the offsets of the MirrorSourceConnector. Wait until replication fully catches up with business data.
  3. Gracefully stop consumers.
  4. Gracefully stop replication.
  5. If using the IdentityReplicationPolicy: Reset the offsets of all consumer groups to the latest offset in the target cluster.

    This ensures that old data is not consumed after the failover. Steps 2 and 3 already ensure that all old data has been successfully consumed.

  6. Configure the producers to connect to the target cluster.
    Producers can safely produce to the exact same topics without any name changes.
  7. Start producers.
  8. Configure consumers to connect to the target cluster.
    Consumers can safely consume from the exact same topics without any name changes.
    • If DefaultReplicationPolicy and topic prefixing is used, the replicated data is separated into the prefixed topic. This only affects new consumers, as old consumers were previously allowed to completely consume old data from the source cluster.
    • If IdentityReplicationPolicy is used, all old data was written into the topic already, since Step 2 and 3 ensure that there will be no more old data coming into the topic. Only newly produced data is written into it after the failover.
  9. Start consumers.
  10. Stop the source cluster.

Performing a failover on disaster

Learn how to perform a failover operation in a disaster scenario between Kafka clusters that have data replication enabled.

In a disaster scenario where your source cluster becomes unavailable, you cannot perform a failover in a controlled manner. In a case like this, a failover operation simply involves reconfiguring and restarting all client applications to use the target Kafka cluster.

In a failover on disaster, the data and the group offsets replicated up until the failure can be used to continue processing.

In a disaster scenario with an uncontrolled stop and crash event, some messages that were successfully accepted in the source cluster might not be replicated to the target cluster. This means that some messages will not be accessible for consumers, even though they were successfully produced into the source cluster. This is due to the fact that replication is asynchronous and may lag behind the source data. This is also true when exactly-once semantics (EOS) is enabled for data replication.

Performing a controlled failback

Learn about performing failback operations between Kafka clusters that have data replication enabled.

A controlled failback operation is the same as a failover operation, but in reverse order. That is, you move your clients back to their original Kafka cluster. Typically this means moving from the target cluster of the replication to the source cluster of the replication some time after a failover operation was performed.

To complete a failback operation, follow the steps for any of the failover operations, but in reverse order. However, take note of the following caveats.

  • A failback assumes a bidirectional replication, as data produced into the target Kafka is not present in source, so the data needs replication.
  • You cannot perform a failback operation if the IdentityReplicationPolicy is in use.

    This is because the IdentityReplicationPolicy does not allow bidirectional replication over the same topics as the topic names are not altered during replication. A bidirectional replication setup with IdentityReplicationPolicy would result in a replication loop where topics are infinitely replicated between source and target clusters. If using the IdentityReplicationPolicy, after a failover you must stop and remove your previous replication setup and reconfigure it again in the reverse direction before you can be ready to failback.

  • The MirrorCheckpointConnector and group offset synchronization only function in the context of a single replication flow. Mapping offsets back to the original topic is not supported.

    This means that any progress made by consumers in the target Kafka cluster over the replicated (prefixed) topics, aka the old data, is lost. There is a high likelihood that consumers will reprocess old data after the failback. You can avoid a scenario like this if the initial failover operation that you carry out is a controlled failover with a cutoff. A failover with a cutoff guarantees that all old data was already consumed.