Performing a failover or failback
Learn about failover and failback operations that you can perform between two Kafka clusters that have data replication enabled. Performing a failback or failover operation enables you to migrate consumer and producer applications between Kafka clusters. These operations are typically performed after a disaster event or in migration scenarios.
The producer and consumer applications both connect to the source cluster, while a Kafka
Connect cluster is configured to replicate the business topics and synchronize the group offsets
into the target cluster. Note that the business_topic
in the target cluster is
not created by replication. Instead you create this topic in preparation for the failover or
failback scenario.
There are multiple types of failover and failback operations that you can carry out. Which one you perform depends on your scenario and use case. The failover and failback types are as follows.
- Continuous and controlled failover
- A continuous and controlled failover is carried out when all applications and services are
working as expected, but you want to move workloads from one cluster to another. This type of
failover is continuous because applications are moved continuously to the target without a
cutoff. This failover can be performed rapidly and comes with minimal service
disruptions.
This failover type works with DefaultReplicationPolicy only.
- Controlled failover with a cutoff
- A controlled failover with a cutoff is carried out when all applications and services are
working as expected. The cutoff means that producers are stopped for the duration of the
failover and consumer traffic is exhausted in the source cluster.
Compared to a continuous failover, this failover is more complex, but does not rely on group offset syncing, and can also guarantee message ordering for consumers even across the failover.
This failover type works with both the DefaultReplicationPolicy and IdentityReplicationPolicy.
- Failover on disaster
- A failover on disaster is carried out when you encounter a disaster scenario where your source cluster becomes unavailable. A failover on a disaster simply consists of reconfiguring and restarting your client applications to use the target Kafka cluster.
- Controlled failback
- A controlled failback is the same as a failover operation but in a reverse order. That is, you move clients back to their original cluster. A failback operation assumes that you already performed a failover operation.
Performing a continuous and controlled failover
Learn how to perform a continuous and controlled failover between Kafka clusters that have data replication enabled.
A continuous and controlled failover is carried out when all applications and services are working as expected. That is, there is no disaster scenario. Instead you make an executive decision to move your workload from the source cluster to the target cluster so that you can stop the source cluster, either temporarily or permanently, without disrupting applications.
The failover is continuous because applications can be continuously moved to the target cluster without a strict cutoff. Because of this, the failover can be performed rapidly with minimal service disruptions.
Throughout this process, replication of Kafka data is not stopped, ensuring that no data is lost.
Performing a controlled failover with a cutoff
Learn how to perform a controlled failover with a cutoff between Kafka clusters that have data replication enabled.
A controlled failover with a cutoff is carried out when all applications and services are working as expected. That is, there is no disaster scenario. Instead you make an executive decision to stop the source cluster, either temporarily or permanently, and move your workload from the source to the target cluster.
The failover has a cutoff because producers are stopped for the duration of the failover. Additionally, all consumer traffic is exhausted in the source cluster. This results in a longer disruption in client applications.
Performing a failover on disaster
Learn how to perform a failover operation in a disaster scenario between Kafka clusters that have data replication enabled.
In a disaster scenario where your source cluster becomes unavailable, you cannot perform a failover in a controlled manner. In a case like this, a failover operation simply involves reconfiguring and restarting all client applications to use the target Kafka cluster.
In a failover on disaster, the data and the group offsets replicated up until the failure can be used to continue processing.
In a disaster scenario with an uncontrolled stop and crash event, some messages that were successfully accepted in the source cluster might not be replicated to the target cluster. This means that some messages will not be accessible for consumers, even though they were successfully produced into the source cluster. This is due to the fact that replication is asynchronous and may lag behind the source data. This is also true when exactly-once semantics (EOS) is enabled for data replication.
Performing a controlled failback
Learn about performing failback operations between Kafka clusters that have data replication enabled.
To complete a failback operation, follow the steps for any of the failover operations, but in reverse order. However, take note of the following caveats.
- A failback assumes a bidirectional replication, as data produced into the target Kafka is not present in source, so the data needs replication.
- You cannot perform a failback operation if the
IdentityReplicationPolicy is in use.
This is because the IdentityReplicationPolicy does not allow bidirectional replication over the same topics as the topic names are not altered during replication. A bidirectional replication setup with IdentityReplicationPolicy would result in a replication loop where topics are infinitely replicated between source and target clusters. If using the IdentityReplicationPolicy, after a failover you must stop and remove your previous replication setup and reconfigure it again in the reverse direction before you can be ready to failback.
- The MirrorCheckpointConnector and group offset
synchronization only function in the context of a single replication flow.
Mapping offsets back to the original topic is not supported.
This means that any progress made by consumers in the target Kafka cluster over the replicated (prefixed) topics, aka the old data, is lost. There is a high likelihood that consumers will reprocess old data after the failback. You can avoid a scenario like this if the initial failover operation that you carry out is a controlled failover with a cutoff. A failover with a cutoff guarantees that all old data was already consumed.