Using Kafka with Spark Streaming

For information on how to configure Spark Streaming to receive data from Kafka, see the Spark Streaming + Kafka Integration Guide.

Validating Kafka Integration with Spark Streaming

To validate your Kafka integration with Spark Streaming, run the KafkaWordCount example:
/opt/cloudera/parcels/CDH/lib/spark/bin/run-example streaming.KafkaWordCount <zkQuorum> <group> <topics> <numThreads>
Replace the variables as follows:
  • <zkQuorum> - ZooKeeper quorum URI used by Kafka (for example, zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181).
  • <group> - Consumer group used by the application.
  • <topic> - Kafka topic containing the data for the application.
  • <numThreads> - Number of consumer threads reading the data. If this is higher than the number of partitions in the Kafka topic, some threads will be idle.

Building Your Own Spark Streaming Application

To deploy your own application, follow these steps:
  1. Build an uber-jar (a single JAR that includes the application and all dependencies, such as Kafka and ZooKeeper) using a Maven plugin such as Assembly or Shade.

    Download an example application here for reference. Kafka and ZooKeeper are specified as dependencies, even though they are not used directly in the code.

  2. Build the project using mvn install and copy the uber-jar to the cluster.
  3. To run the application, use spark-submit:
    spark-submit --master <master> --class <application_main_class> <JAR> <parameters>

See the Spark documentation for information on which master to use and how to specify it.

To run the provided example application on a local master, run the following:
spark-submit --master local[*] --class com.shapira.examples.streamingavg.StreamingAvg uber-StreamingAvg-1.0-SNAPSHOT.jar localhost:2181/kafka group1 topic3 1