Using Apache Storm
Also available as:
PDF

Chapter 5. Ingesting Data with the Apache Kafka Spout Connector

Apache Kafka is a high-throughput, distributed messaging system. Apache Storm provides a Kafka spout to facilitate ingesting data from Kafka 0.8x brokers. Storm developers should include downstream bolts in their topologies to process data ingested with the Kafka spout.

The storm-kafka components include a core-storm spout, as well as a fully transactional Trident spout. Storm-kafka spouts provide the following key features:

  • 'Exactly once' tuple processing with the Trident API

  • Dynamic discovery of Kafka brokers and partitions

Hortonworks recommends that Storm developers use the Trident API. However, use the core-storm API if sub-second latency is critical for your Storm topology.

  • The core-storm API represents a Kafka spout with the KafkaSpout class.

  • The Trident API provides a OpaqueTridentKafkaSpout class to represent the spout.

To initialize KafkaSpout and OpaqueTridentKafkaSpout, Storm developers need an instance of a subclass of the KafkaConfig class, which represents configuration information needed to ingest data from a Kafka cluster.

  • The KafkaSpout constructor requires the SpoutConfig subclass.

  • The OpaqueTridentKafkaSpout requires the TridentKafkaConfig subclass.

In turn, the constructors for both KafkaSpout and OpaqueTridentKafkaSpout require an implementation of the BrokerHosts interface, which is used to map Kafka brokers to topic partitions. The storm-kafka component provides two implementations of BrokerHosts: ZkHosts and StaticHosts.

  • Use the ZkHosts implementation to dynamically track broker-to-partition mapping.

  • Use the StaticHosts implementation for static broker-to-partition mapping.