Configuring SRM Driver for performance tuning
Learn about the configuration options available for tuning the performance of SRM Driver.
To perform performance tuning for SRM Driver, it is important that you understand the task architecture of SRM. For more information, see Streams Replication Manager Architecture.
It is also important that you are familiar with the task load balancing method used by SRM. For more information, see Task architecture and load-balancing.
- Hardware
- Task count
- Consumer and producer configurations of
MirrorSourceTask
- Connect worker configurations
Hardware
Learn about the hardware-related aspects of SRM performance tuning, such as memory requirements.
For high workloads, SRM Drivers need to run on dedicated nodes. SRM Drivers are typically network-bound, but due to configuration tweaks, memory consumption can also be high.
For small workloads, a maximum heap size of 8 GB is sufficient, however, for larger workloads,
Cloudera recommends a higher maximum heap size. SRM Driver performance can significantly degrade
when limited by memory or due to garbage collection pauses. Always make sure that SRM Drivers
have enough memory. You can define the heap size using the SRM_HEAP_OPTS
configuration variable.
Task count
Learn about configuring optimal task counts in relation to the number of topic partitions, SRM Drivers, and SRM Driver nodes.
The task count specifies the level of parallelism of the data replication. One task corresponds to (approximately) one worker thread on the SRM Driver cluster. Each task runs a consumer and a producer instance, meaning more connections to the source and target Kafka clusters.
To increase the parallelism of the work, you can increase the Tasks Max property in Cloudera Manager. Since the unit of work is the source topic-partition, there is no need for the value of Tasks Max to exceed the number of replicated topic-partitions, as it has no effect on the throughput of the system.
As a baseline recommendation, When SRM Drivers are running on dedicated nodes Cloudera recommends setting the value of Tasks Max approximately to the number of SRM Drivers times the number of SRM Driver node cores.
Consumer and producer configurations
Learn about consumer and producer properties you can tweak to increase throughput and decrease overhead.
A MirrorSourceTask manages one consumer and one producer to drive the data replication. Similarly to custom Kafka client applications, you need to tweak these consumer and producer instances when expecting a high throughput on the replication flow. The following list is only an overview of properties that are typically tweaked for the consumer and the producer, as well as their application for SRM. On some workloads, you might also need to tweak broker configuration.
Consumer configurations
MirrorSourceTask
, use the
cluster->cluster.consumer.
prefix. You can tweak the following consumer
properties to increase throughput:- fetch.max.bytes
- The maximum size of a fetch response. You can increase this to increase the consumer throughput and reduce the overhead of fetching.
- max.poll.records
- The maximum number of records returned in a fetch response. You can increase this to increase the consumer throughput and reduce the overhead of fetching.
- receive.buffer.bytes
- The size of the TCP receive buffer used by the consumer. When the average response size increases due to tweaking the previous configurations, you also need to increase the receive buffer bytes to reduce the overhead of buffering.
- max.partition.fetch.bytes
- The maximum amount of data returned for one topic partition in a fetch response. In some cases, where there are source topics with high ingestion rate to be replicated, you need to increase this configuration to allow the consumer to keep up with the ingestion rate. If the source topics have a similar ingestion rate, you do not need to change this.
Producer configurations
MirrorSourceTask
, use the
cluster->cluster.producer.override.
prefix. You can tweak the following
producer properties to increase throughput:- Producer Batch Size
- The maximum batch size created by the producer. Batches are created for each topic partition. You can increase this to increase the producer throughput and reduce the overhead of producing. When tweaking this property, make sure to check the value of max.request.size, as well as message.max.bytes on the broker side.
- send.buffer.bytes
- The size of the TCP send buffer used by the producer. When the average request size increases due to tweaking the previous configurations, you also need to increase the send buffer bytes to reduce the overhead of buffering.
- Producer Compression Type
- The compression type to use in the producer. You can use producer side compression to achieve higher throughput.
- max.request.size
- The maximum request size sent by the producer. The batch size cannot exceed this configuration. When tweaking this property, make sure to check the value of Producer Batch Size, as well as message.max.bytes on the broker side.
- Producer Buffer Memory
- The maximum amount of memory the producer can use for buffering records. For high throughput replications, heavy batching is necessary on the producer side. To allow heavy batching, you need to increase the buffer memory.
Connect worker configurations
Learn about the Connect worker properties you can change to facilitate heavy loads on SRM.
- offset.flush.timeout.ms
- Affects the timeout when waiting for in-flight produce requests to finish before committing the source offsets. The default value is five seconds, which is a low timeout. It might be exceeded when there is a high load on a single task. You can safely increase this value, as the timeout does not affect the throughput of the actual flow, and data replication continues even when the Connect framework is waiting for a safe source offset commit.
- Offset Flush Interval
- Controls the interval of the source offset flush. The offset flush does not have a significant impact in the data replication throughput. A possible reason to tweak this configuration is the average offset flush duration getting close to the interval. In such a case, increasing the interval might reduce unnecessary work.
Metric calculation
The calculation of the following metrics might impact performance.
replication-records-lag
You can tweak the performance impact of the replication-records-lag metric by tweaking the values of the following properties in Streams Replication Manager's Replication Configs in Cloudera Manager.
- replication.records.lag.calc.enabled
- Controls whether the
replication-records-lag
metric is calculated. This metric provides information regarding the replication lag based on offsets. The metric is available both on the cluster and the topic level. The calculation of this metric might add latency to replications and impact SRM performance. If you are experiencing performance issues, you can try setting this property to false to disable the calculation ofreplication-records-lag
. Alternatively, you can try fine-tuning how SRM calculatesreplication-records-lag
with thereplication.records.lag.calc.period.ms
andreplication.records.lag.end.offset.timeout.ms
properties. - replication.records.lag.calc.period.ms
- Controls how frequently SRM calculates the
replication-records-lag
metric. The default value of 0 means that the metric is calculated continuously. Cloudera recommends configuring this property to 15000 ms (15 seconds) or higher if you are experiencing issues related to the calculation ofreplication-records-lag
. A calculation frequency of 15 seconds or more results in the metric being available for consumption without any significant impact on SRM performance. - replication.records.lag.end.offset.timeout.ms
- Specifies the Kafka end offset timeout value used for
calculating the
replication-records-lag
metric. Setting this property to a lower value than the default 6000 ms (1 minute) might reduce latency in calculatingreplication-records-lag
, however, replication-records-lag calculation might fail. A value higher than the default can help avoid metric calculation failures, but might increase replication latency and lower SRM performance.