Streams Replication Manager recommended deployment architecture

Learn about pull mode, which is the Cloudera-recommended deployment architecture for Streams Replication Manager.

While Streams Replication Manager can be deployed in many different ways, the Cloudera recommended setup is pull mode. Pull mode refers to a Streams Replication Manager deployment where data replication happens by pulling data from remote source clusters, rather than pushing data into remote target clusters. A Streams Replication Manager deployment that is in pull mode conforms to the following:

An instance of the Streams Replication Manager service (Driver and Service roles) is deployed on all clusters that host a target Kafka cluster. In other words, each target Kafka cluster in the deployment has a co-located Streams Replication Manager service.
All instances of the Streams Replication Manager service use the same replication policy.
Streams Replication Manager Driver roles only execute replications that target their co-located Kafka cluster.
Streams Replication Manager Service roles only target their co-located Kafka cluster.
If using the IdentityReplicationPolicy in a bidirectional setup, no replication loops are present.

The reason why pull mode is recommended is because this is the deployment type that was found to provide the highest amount of resilience against various timeout and network instability issues. For example:

Streams Replication Manager Driver roles run Connect workers which are coordinating through the target cluster of the replication flow. This means that Connect workers are more closely tied to the target cluster than the source cluster. Having the Streams Replication Manager Drivers and the target Kafka cluster closely located to each other minimizes group membership and rebalance related timeout issues. Additionally, this also minimizes the network instability on the producer side of the Streams Replication Manager Drivers which reduces data duplication in the target cluster.
Streams Replication Manager Service roles also coordinate through the target cluster. This makes them more sensitive to timeout and network partition issues tied to the target cluster. In addition, the Streams Replication Manager Service reads from and writes to the target Kafka through the Kafka Streams application. Having the Streams Replication Manager Services and the target Kafka clusters closely located to each other can minimize timeout and network partition issues. Additionally, for Kafka clusters in the cloud, hosting the Streams Replication Manager Service in the same data center as the Kafka cluster can help keep cloud costs to a minimum.
In a situation where there is a network partition, or one of the clusters in the replication is unavailable, it is preferable to let the target cluster pull the data when the connection is established, rather than the source cluster trying to push data indefinitely.

Pull mode deployment example🔗

Consider a simple deployment that has three clusters. Cluster A, B, and C. Each of them has a Kafka cluster.

To achieve a pull mode deployment, you must deploy the Streams Replication Manager service (both Driver and Service roles) on all three clusters. The Driver and Service roles must target their co-located Kafka cluster.

To be more precise, in the case of Driver targets, what you must ensure is that each source cluster is targeted by at least a single Driver. This is required so that a heartbeats topic is created in that cluster. This is ensured in the this example because each Kafka has a co-located Driver targeting it. However, if you have a unidirectional replication setup, where the source Kafka cluster does not have a co-located Streams Replication Manager service, you must ensure that one of your Drivers is targeting that Kafka cluster.

Once setup is complete, you can start configuring your replications. For example, assume that you want to have data replicated from Cluster B and C to cluster A. In this case, you need to enable two separate replications, B->A and C->A. To achieve pull mode, the two replications must be executed by Streams Replication Manager Driver A.

Any number of replications can be set up between the clusters, but you must always ensure that each Driver is only executing the replications targeting their co-located cluster. For example, assume that in addition to the replications set up previously, you also want to set up replication to Cluster B from Cluster A and Cluster C. In this case, the deployment would changes as follows.

Streams Replication Manager recommended deployment architecture

Pull mode deployment example🔗

We want your opinion

How can we improve this page?