Configuring SRM Driver for performance tuning

Learn about the configuration options available for tuning the performance of SRM Driver.

To perform performance tuning for SRM Driver, it is important that you understand the task architecture of SRM. For more information, see Streams Replication Manager Architecture.

It is also important that you are familiar with the task load balancing method used by SRM. For more information, see Task architecture and load-balancing.

You need to tune the following main groups of configurations for SRM Driver when running a replication flow with a heavy load:

Hardware
Task count
Consumer and producer configurations of MirrorSourceTask
Connect worker configurations

Hardware

Learn about the hardware-related aspects of SRM performance tuning, such as memory requirements.

For high workloads, SRM Drivers need to run on dedicated nodes. SRM Drivers are typically network-bound, but due to configuration tweaks, memory consumption can also be high.

For small workloads, a maximum heap size of 8 GB is sufficient, however, for larger workloads, Cloudera recommends a higher maximum heap size. SRM Driver performance can significantly degrade when limited by memory or due to garbage collection pauses. Always make sure that SRM Drivers have enough memory. You can define the heap size using the SRM_HEAP_OPTS configuration variable.

Task count

Learn about configuring optimal task counts in relation to the number of topic partitions, SRM Drivers, and SRM Driver nodes.

The task count specifies the level of parallelism of the data replication. One task corresponds to (approximately) one worker thread on the SRM Driver cluster. Each task runs a consumer and a producer instance, meaning more connections to the source and target Kafka clusters.

To increase the parallelism of the work, you can increase the Tasks Max property in Cloudera Manager. Since the unit of work is the source topic-partition, there is no need for the value of Tasks Max to exceed the number of replicated topic-partitions, as it has no effect on the throughput of the system.

As a baseline recommendation, When SRM Drivers are running on dedicated nodes Cloudera recommends setting the value of Tasks Max approximately to the number of SRM Drivers times the number of SRM Driver node cores.

Consumer and producer configurations

Learn about consumer and producer properties you can tweak to increase throughput and decrease overhead.

A MirrorSourceTask manages one consumer and one producer to drive the data replication. Similarly to custom Kafka client applications, you need to tweak these consumer and producer instances when expecting a high throughput on the replication flow. The following list is only an overview of properties that are typically tweaked for the consumer and the producer, as well as their application for SRM. On some workloads, you might also need to tweak broker configuration.

Consumer configurations

To configure the consumer inside the MirrorSourceTask, use the cluster->cluster.consumer. prefix. You can tweak the following consumer properties to increase throughput:

fetch.max.bytes: The maximum size of a fetch response. You can increase this to increase the consumer throughput and reduce the overhead of fetching.
max.poll.records: The maximum number of records returned in a fetch response. You can increase this to increase the consumer throughput and reduce the overhead of fetching.
receive.buffer.bytes: The size of the TCP receive buffer used by the consumer. When the average response size increases due to tweaking the previous configurations, you also need to increase the receive buffer bytes to reduce the overhead of buffering.
max.partition.fetch.bytes: The maximum amount of data returned for one topic partition in a fetch response. In some cases, where there are source topics with high ingestion rate to be replicated, you need to increase this configuration to allow the consumer to keep up with the ingestion rate. If the source topics have a similar ingestion rate, you do not need to change this.

Producer configurations

To configure the producer inside the MirrorSourceTask, use the cluster->cluster.producer.override. prefix. You can tweak the following producer properties to increase throughput:

Producer Batch Size: The maximum batch size created by the producer. Batches are created for each topic partition. You can increase this to increase the producer throughput and reduce the overhead of producing. When tweaking this property, make sure to check the value of max.request.size, as well as message.max.bytes on the broker side.
send.buffer.bytes: The size of the TCP send buffer used by the producer. When the average request size increases due to tweaking the previous configurations, you also need to increase the send buffer bytes to reduce the overhead of buffering.
Producer Compression Type: The compression type to use in the producer. You can use producer side compression to achieve higher throughput.
max.request.size: The maximum request size sent by the producer. The batch size cannot exceed this configuration. When tweaking this property, make sure to check the value of Producer Batch Size, as well as message.max.bytes on the broker side.
Producer Buffer Memory: The maximum amount of memory the producer can use for buffering records. For high throughput replications, heavy batching is necessary on the producer side. To allow heavy batching, you need to increase the buffer memory.

Connect worker configurations

Learn about the Connect worker properties you can change to facilitate heavy loads on SRM.

The Connect worker does not significantly influence the throughput of SRM. However, when running a heavy load with SRM, you might need to tweak the following properties.

offset.flush.timeout.ms: Affects the timeout when waiting for in-flight produce requests to finish before committing the source offsets. The default value is five seconds, which is a low timeout. It might be exceeded when there is a high load on a single task. You can safely increase this value, as the timeout does not affect the throughput of the actual flow, and data replication continues even when the Connect framework is waiting for a safe source offset commit.
tip
Consider changing this when the ERROR level message Failed to flush, timed out while waiting for producer to flush outstanding X messages appears in the SRM Driver logs .
Offset Flush Interval: Controls the interval of the source offset flush. The offset flush does not have a significant impact in the data replication throughput. A possible reason to tweak this configuration is the average offset flush duration getting close to the interval. In such a case, increasing the interval might reduce unnecessary work.
tip
Consider changing this when the average flush duration seems to be close to the flush interval, based on the DEBUG level message Finished offset commitOffsets successfully in X ms in the SRM Driver logs.

Metric calculation

The calculation of the following metrics might impact performance.

replication-records-lag

You can tweak the performance impact of the replication-records-lag metric by tweaking the values of the following properties in Streams Replication Manager's Replication Configs in Cloudera Manager.

replication.records.lag.calc.enabled: Controls whether the replication-records-lag metric is calculated. This metric provides information regarding the replication lag based on offsets. The metric is available both on the cluster and the topic level. The calculation of this metric might add latency to replications and impact SRM performance. If you are experiencing performance issues, you can try setting this property to false to disable the calculation of replication-records-lag. Alternatively, you can try fine-tuning how SRM calculates replication-records-lag with the replication.records.lag.calc.period.ms and replication.records.lag.end.offset.timeout.ms properties.
replication.records.lag.calc.period.ms: Controls how frequently SRM calculates the replication-records-lag metric. The default value of 0 means that the metric is calculated continuously. Cloudera recommends configuring this property to 15000 ms (15 seconds) or higher if you are experiencing issues related to the calculation of replication-records-lag. A calculation frequency of 15 seconds or more results in the metric being available for consumption without any significant impact on SRM performance.
replication.records.lag.end.offset.timeout.ms: Specifies the Kafka end offset timeout value used for calculating the replication-records-lag metric. Setting this property to a lower value than the default 6000 ms (1 minute) might reduce latency in calculating replication-records-lag, however, replication-records-lag calculation might fail. A value higher than the default can help avoid metric calculation failures, but might increase replication latency and lower SRM performance.