Chapter 5. Ingesting Data with the Apache Kafka Spout Connector
Apache Kafka is a high-throughput, distributed messaging system. Apache Storm provides a Kafka spout to facilitate ingesting data from Kafka 0.8x brokers. Storm developers should include downstream bolts in their topologies to process data ingested with the Kafka spout.
The storm-kafka
components include a core-storm spout, as well as a
fully transactional Trident spout. Storm-kafka spouts provide the following key
features:
'Exactly once' tuple processing with the Trident API
Dynamic discovery of Kafka brokers and partitions
Hortonworks recommends that Storm developers use the Trident API. However, use the core-storm API if sub-second latency is critical for your Storm topology.
The core-storm API represents a Kafka spout with the
KafkaSpout
class.The Trident API provides a
OpaqueTridentKafkaSpout
class to represent the spout.
To initialize KafkaSpout
and
OpaqueTridentKafkaSpout
, Storm developers need an instance of a
subclass of the KafkaConfig
class, which represents configuration
information needed to ingest data from a Kafka cluster.
The
KafkaSpout
constructor requires theSpoutConfig
subclass.The
OpaqueTridentKafkaSpout
requires theTridentKafkaConfig
subclass.
In turn, the constructors for both KafkaSpout
and
OpaqueTridentKafkaSpout
require an implementation of the
BrokerHosts
interface, which is used to map Kafka brokers to
topic partitions. The storm-kafka component provides two implementations of
BrokerHosts
: ZkHosts
and
StaticHosts
.
Use the
ZkHosts
implementation to dynamically track broker-to-partition mapping.Use the
StaticHosts
implementation for static broker-to-partition mapping.