Spark QuickStart Guide
Also available as:
PDF

Chapter 5. Installing Spark with Kerberos

Spark jobs are submitted to a Hadoop cluster as YARN jobs. The developer creates a Spark application in a local environment, and tests it in a single-node Spark Standalone cluster on their developer workstation.

When a job is ready to run in a production environment, there are a few additional steps if the cluster is Kerberized:

  • The Spark History Server daemon needs a Kerberos account and keytab to run in a Kerberized cluster.

  • To submit Spark jobs in a Kerberized cluster, the account (or person) submitting jobs needs a Kerberos account & keytab.

    • When access is authenticated without human interaction -- as happens for processes that submit job requests -- the process would use a headless keytab. Security risk is mitigated by ensuring that only the service who should be using the headless keytab has the permissions to read it.

    • An end user should use their own keytab when submitting a Spark job.

Setting Up Principals and Keytabs for End User Access to Spark

In the following example, user $USERNAME runs the Spark Pi job in a Kerberos-enabled environment:

su $USERNAME
kinit USERNAME@YOUR-LOCAL-REALM.COM 
cd /usr/hdp/current/spark-client/
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 512m \
    --executor-memory 512m \
    --executor-cores 1 \
    lib/spark-examples*.jar 10

Setting Up Service Principals and Keytabs for Processes Submitting Spark Jobs

The following example shows the creation and use of a headless keytab for a spark service user account that will submit Spark jobs on node blue1@example.com:

  1. Create a Kerberos service principal for user spark:

    kadmin.local -q "addprinc -randkey spark/blue1@EXAMPLE.COM"

  2. Create the keytab:

    kadmin.local -q "xst -k /etc/security/keytabs/spark.keytab spark/blue1@EXAMPLE.COM"

  3. Create a spark user and add it to the hadoop group. (Do this for every node of your cluster.)

    useradd spark -g hadoop

  4. Make spark the owner of the newly-created keytab:

    chown spark:hadoop /etc/security/keytabs/spark.keytab

  5. Limit access: make sure user spark is the only user with access to the keytab:

    chmod 400 /etc/security/keytabs/spark.keytab

In the following steps, user spark runs the Spark Pi example in a Kerberos-enabled environment:

su spark  
kinit -kt /etc/security/keytabs/spark.keytab spark/blue1@EXAMPLE.COM
cd /usr/hdp/current/spark-client/
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 1 \
    --driver-memory 512m \
    --executor-memory 512m \
    --executor-cores 1 \
    lib/spark-examples*.jar 10

Accessing the Hive Metastore in Secure Mode

Requirements for accessing the Hive Metastore in secure mode (with Kerberos):

  • The Spark thrift server must be co-located with the Hive thrift server.

  • The spark user must be able to access the Hive keytab.

  • In yarn-client mode on a secure cluster you can use HiveContext to access the Hive Metastore. (HiveContext is not supported for yarn-cluster mode on a secure cluster.)