Tuning KafkaSpout Performance
KafkaSpout provides two internal parameters to control performance:
offset.commit.period.ms
specifies the period of time (in milliseconds) after which the spout commits to Kafka. To set this parameter, use the KafkaSpoutConfig set method setOffsetCommitPeriodMs.max.uncommitted.offsets
defines the maximum number of polled offsets (records) that can be pending commit before another poll can take place. When this limit is reached, no more offsets can be polled until the next succesful commit sets the number of pending offsets below the threshold. To set this parameter, use the KafkaSpoutConfig set method setMaxUncommittedOffsets.
Note that these two parameters trade off memory versus time:
When
offset.commit.period.ms
is set to a low value, the spout commits to Kafka more often. When the spout is committing to Kafka, it is not fetching new records nor processing new tuples.When
max.uncommitted.offsets
increases, the memory footprint increases. Each offset uses eight bytes of memory, which means that a value of 10000000 (10MB) uses about 80MB of memory.
It is possible to achieve good performance with a low commit period and small
memory footprint (a small value for max.uncommitted.offsets
), as well
as with a larger commit period and larger memory footprint. However, you should
avoid using large values for offset.commit.period.ms
with a low value
for max.uncommitted.offsets
.
Kafka consumer configuration parameters can also have an impact on the KafkaSpout performance. The following Kafka parameters are most likely to have the strongest impact on KafkaSpout performance:
The Kafka Consumer poll timeout specifies the time (in milliseconds) spent polling if data is not available. To set this parameter, use the KafkaSpoutConfig set method setPollTimeoutMs.
Kafka consumer parameter
fetch.min.bytes
specifies the minimum amount of data the server returns for a fetch request. If the minimum amount is not available, the request waits until the minimum amount accumulates before answering the request.Kafka consumer parameter
fetch.max.wait.ms
specifies the maximum amount of time the server will wait before answering a fetch request, when there is not sufficient data to satisfyfetch.min.bytes
.
Important | |
---|---|
For clusters in production use, you should override the default values of KafkaSpout
parameters
|
Performance also depends on the structure of your Kafka cluster, the distribution of the data, and the availability of data to poll.
Kafka Spouts Performance Comparison
The new KafkaSpout implementation outperforms the older KafkaSpout implementation, with reduced latency and increased throughput.
Log Level Performance Impact
Storm supports several logging levels, including Trace, Debug, Info, Warn, and Error. Trace-level logging has a significant impact on performance, and should be avoided in production. The amount of log messages is proportional to the number of records fetched from Kafka, so a lot of messages are printed when Trace-level logging is enabled.
Trace-level logging is most useful for debugging pre-production
environments under mild load. For debugging, if necessary, you can throttle how many
messages are polled from Kafka by setting the max.partition.fetch.bytes
parameter to a low number that is larger than than the largest single message stored
in Kafka.
Logs with Debug level will have slightly less performance impact than Trace-level logs, but still generate a lot of messages. This setting can be useful for assessing whether the Kafka spout is properly tuned.
For general information about Apache Storm logging features, see Monitoring and Debugging an Apache Storm Topology.