Using Kafka with Spark Streaming

For information on how to configure Spark Streaming to receive data from Kafka, see the Spark Streaming + Kafka Integration Guide.

In CDH 5.7 and higher, the Spark connector to Kafka only works with Kafka 2.0 and higher.

Validating Kafka Integration with Spark Streaming

To validate your Kafka integration with Spark Streaming, run the KafkaWordCount example.

If you installed Spark using parcels, use the following command:
/opt/cloudera/parcels/CDH/lib/spark/bin/run-example streaming.KafkaWordCount <zkQuorum> <group> <topics> <numThreads>

If you installed Spark using packages, use the following command:

 /usr/lib/spark/bin/run-example streaming.KafkaWordCount <zkQuorum> <group> <topics><numThreads>
Replace the variables as follows:
  • <zkQuorum> - ZooKeeper quorum URI used by Kafka (for example, zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181).
  • <group> - Consumer group used by the application.
  • <topic> - Kafka topic containing the data for the application.
  • <numThreads> - Number of consumer threads reading the data. If this is higher than the number of partitions in the Kafka topic, some threads will be idle.