Kafka disaster recovery
Learn about the cluster architectures you can use when designing highly available and resilient Kafka clusters.
Kafka has built-in replication for its topics and allows users to carefully tweak the data durability configurations to achieve desired redundancy guarantees. When designing a deployment where data is replicated over multiple Data Centers (DC) and has disaster recovery (DR) options, you need to carefully analyse your data durability and business continuity requirements. The following sections introduce the architectural options that you can choose from when building a resilient multi-DC Kafka deployment. Additionally, guidance is provided that help you in choosing the right architecture for your use case.
- Stretch clusters
- A stretch cluster is a single logical Kafka cluster deployed across multiple DCs or other independent physical infrastructures such as cloud availability zones. For more information, see Stretch clusters or Kafka Stretch cluster reference architecture.
- Replication using SRM
- Streams Replication Manager (SRM) is an enterprise-grade replication solution that enables fault tolerant, scalable, and robust cross-cluster Kafka topic replication. A deployment that utilizes SRM for disaster recovery uses multiple Kafka clusters. SRM acts as a bridge between the clusters and replicates the data. For more information, see Streams Replication Manager Overview.
Use a stretch cluster if
- You have a zero Recovery Point Objective (RPO=0) requirement
- RPO=0 means that data should be replicated to multiple DCs before a write is deemed
successful (acknowledged by the broker). This can only be achieved by using
synchronous replication between DCs, which can be achieved by using a stretch
cluster.
However, consider the following. If you can replay data from upstream sources, such as databases, then implementing that recovery function once may be easier than operating and maintaining a stretch cluster.
- You need strict message ordering guarantees across DCs
- Strict message ordering per partition can only be achieved with a single topic
spanning multiple DCs. The SRM architecture involves multiple topics (one original and
many replicas), which cannot guarantee the strict ordering during failover or failback
operations.
However, consider the following. If your data has an attribute that can be reliably used for ordering, then implementing a reordering step in your downstream processing during application development might be an easier and more cost-effective solution compared to operating and maintaining a stretch cluster.
- You need automatic failover for your clients when a DC goes down
- The Kafka protocol has built-in cluster discovery and leadership migration on failures. Therefore, fully automatic failover operations can be achieved using a stretch cluster. The SRM based architecture requires a manual step in the failover process. This makes SRM unsuitable for this use case.
- You need exactly once transactional processing
- Exactly once processing in Kafka is currently only supported within a single cluster.
Use SRM with multiple clusters if
- You need high availability during cluster maintenance
- When a Kafka cluster needs to be stopped for maintenance, clients can fail over to a backup cluster. With the stretch cluster solution, this is not supported. This is due to the fact there is a single Kafka cluster in the architecture.
- You need replication between clusters that have high latency (replication across multiple regions)
- The stretch cluster architecture is sensitive to high latency, making it unsuitable for multi-region deployments. The asynchronous replication provided by SRM works well even in high latency environments.
- You need high throughput replication between DCs
- The throughput of the stretch cluster architecture degrades rapidly with increasing latency. SRM, on the other hand, can provide better replication throughput even in high latency environments.