Spark Guide
Also available as:
PDF
loading table of contents...

Using Spark Streaming with Kafka on a Kerberos-Enabled Cluster

This section describes specific steps for developers using Spark Streaming with Kafka on a Kerberos-enabled cluster.

Adding the spark-kafka-streaming jar File to a Deployed Build

Before running Spark streaming jobs with Kafka in a Kerberos environment, you will need to add or retrieve the HDP spark-streaming-kafka jar file and associated jar files.

[Note]Note

The spark-streaming-kafka jar file is required for running a job that is not a Spark example job provided with HDP. If you are running a job that is part of the Spark examples package installed by HDP, you will not need the spark-streaming-kafka jar.

Instructions for Developing and Building Applications

If you are using maven as a compile tool:

  1. Add the hortonworks repo to your pom.xml file:

    <repository>
        <id>hortonworks</id>
        <name>hortonworks repo</name>
        <url>http://repo.hortonworks.com/content/repositories/releases/</url>
    </repository>
  2. Specify the Hortonworks version number for Spark streaming Kafka and streaming dependencies to your pom.xml file:

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka_2.10</artifactId>
        <version>1.6.2.2.4.2.0-90</version>
    </dependency>
    
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.10</artifactId>
        <version>1.6.2.2.4.2.0-90</version>
        <scope>provided</scope>
    </dependency>

    Note that the correct version number includes the Spark version and the HDP version.

  3. (Optional) The default scope of the spark-streaming jar is "provided", which means that the jar is provided by the environment; it will not pack into an uber jar. (An uber jar packages all dependencies into one jar.) If you prefer to pack a uber jar, add the maven-shade-plugin to your pom.xml file:

    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.3</version>
        <executions>
            <execution>
                <phase>package</phase>
                <goals>
                    <goal>shade</goal>
                </goals>
            </execution>
        </executions>
        <configuration>
            <filters>
                <filter>
                    <artifact>*:*</artifact>
                    <excludes>
                        <exclude>META-INF/*.SF</exclude>
                        <exclude>META-INF/*.DSA</exclude>
                        <exclude>META-INF/*.RSA</exclude>
                    </excludes>
                </filter>
            </filters>
            <finalName>uber-${project.artifactId}-${project.version}</finalName>
        </configuration>
    </plugin>

Instructions for Submitting your Spark Streaming Job

Instructions for submitting your job depend on whether you used an uber jar or not:

  • If you kept the default jar scope and you can access an external network, use --packages to download dependencies in the runtime library:

    spark-submit --master yarn-client --num-executors 1 \
    --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.2.2.4.2.0-90 \
    --repositories http://repo.hortonworks.com/content/repositories/releases/ \
    --class <user-main-class> \
    <user-application.jar> \
    <user arg lists>

    The artifact and repository locations should be the same as specified in your pom.xml file.

  • If you packed the jar into an uber jar, submit the jar like you would a regular Spark application:

    spark-submit --master yarn-client --num-executors 1 \
    --class <user-main-class> \
    <user-uber-application.jar> \
    <user arg lists>

For a sample pom.xml file, see Sample pom.xml file for Spark Streaming with Kafka.

Running Spark Streaming - Kafka Jobs on a Kerberos-Enabled Cluster

The following instructions assume that Spark and Kafka are already deployed on a Kerberos-enabled cluster.

  1. Select or create a user account to be used as principal. This should not be the kafka or spark service account.

  2. Generate a keytab for the user.

  3. Create a jaas configuration file (for example, key.conf), and add configuration settings to specify the user keytab.

    The following example specifies keytab location ./v.keytab for user vagrant@example.com.

    [Note]Note

    The keytab and configuration files will be distributed using YARN local resources. They will end up in the current directory of the Spark YARN container, thus the location should be specified as ./v.keytab.

    KafkaClient {
       com.sun.security.auth.module.Krb5LoginModule required
       useKeyTab=true
       keyTab="./v.keytab"
       storeKey=true
       useTicketCache=false
       serviceName="kafka"
       principal="vagrant@EXAMPLE.COM";
    };
  4. In your job submission instructions, pass the jaas configuration file and keytab as local resource files. Add the jaas configuration file options to the JVM options specified for the driver and executor:

    [Note]Note

    If you are running a job that is part of the Spark examples package installed by HDP, you do not need to add the spark-streaming-kafka jar. Otherwise, add the spark-streaming-kafka jar using the --jars command-line option.

    --files key.conf#key.conf,v.keytab#v.keytab 
    --driver-java-options "-Djava.security.auth.login.config=./key.conf" 
    --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./key.conf"

    Pass any relevant Kafka security options to your streaming application. For example, the KafkaWordCount example accepts PLAINTEXTSASL as the last option in the command line:

    KafkaWordCount /vagrant/spark-examples.jar c6402:2181 abc ts 1 PLAINTEXTSASL