Configuring Spark for a Kerberos-Enabled Cluster

Before running Spark jobs on a Kerberos-enabled cluster, configure additional settings for the following modules and scenarios:

Spark history server
Spark Thrift server

Individuals who submit jobs
Processes that submit jobs without human interaction

Each of these scenarios is described in the following subsections.

Configuring the Spark History Server

The Spark history server daemon must have a Kerberos account and keytab to run on a Kerberos-enabled cluster.

When you enable Kerberos for a Hadoop cluster with Ambari, Ambari configures Kerberos for the history server and automatically creates a Kerberos account and keytab for it. For more information, see Enabling Kerberos Authentication Using Ambari in the HDP Security Guide.

If your cluster is not managed by Ambari, or if you plan to enable Kerberos manually for the history server, see Creating Service Principals and Keytab Files for HDP in the HDP Security Guide.

Configuring the Spark Thrift Server

If you are installing the Spark Thrift server on a Kerberos-enabled cluster, note the following requirements:

The Spark Thrift server must run in the same host as HiveServer2, so that it can access the hiveserver2 keytab.
Permissions in /var/run/spark and /var/log/spark must specify read/write permissions to the Hive service account.
You must use the Hive service account to start the thriftserver process.

If you access Hive warehouse files through HiveServer2 on a deployment with fine-grained access control, run the Spark Thrift server as user hive. This ensures that the Spark Thrift server can access Hive keytabs, the Hive metastore, and HDFS data stored under user hive.

Important

	Important
If you read files from HDFS directly through an interface such as Hive CLI or Spark CLI (as opposed to HiveServer2 with fine-grained access control implemented), you should use a different service account for the Spark Thrift server. Configure the account so that it can access Hive keytabs and the Hive metastore. Use of an alternate account provides a more secure configuration: when the Spark Thrift server runs queries as user `hive`, all data accessible to user `hive` is accessible to the user submitting the query.

If you read files from HDFS directly through an interface such as Hive CLI or Spark CLI (as opposed to HiveServer2 with fine-grained access control implemented), you should use a different service account for the Spark Thrift server. Configure the account so that it can access Hive keytabs and the Hive metastore. Use of an alternate account provides a more secure configuration: when the Spark Thrift server runs queries as user hive, all data accessible to user hive is accessible to the user submitting the query.

For Spark jobs that are not submitted through the Thrift Server, the user submitting the job must have access to the Hive metastore in secure mode, using the kinit command.

Setting Up Access for Submitting Jobs

Accounts that submit jobs on behalf of other processes must have a Kerberos account and keytab. End users should use their own keytabs (instead of using a headless keytab) when submitting a Spark job. The following two subsections describe both scenarios.

Setting Up Access for an Account

When access is authenticated without human interaction (as happens for processes that submit job requests), the process uses a headless keytab. Security risk is mitigated by ensuring that only the service that should be using the headless keytab has permission to read it.

The following example creates a headless keytab for a spark service user account that will submit Spark jobs on node blue1@example.com:

Create a Kerberos service principal for user spark:
kadmin.local -q "addprinc -randkey spark/blue1@EXAMPLE.COM"
Create the keytab:
kadmin.local -q "xst -k /etc/security/keytabs/spark.keytab spark/blue1@EXAMPLE.COM"
For every node of your cluster, create a spark user and add it to the hadoop group:
useradd spark -g hadoop
Make spark the owner of the newly created keytab:
chown spark:hadoop /etc/security/keytabs/spark.keytab
Limit access by ensuring that user spark is the only user with access to the keytab:
chmod 400 /etc/security/keytabs/spark.keytab

In the following example, user spark runs the Spark Pi example in a Kerberos-enabled environment:

su spark  
kinit -kt /etc/security/keytabs/spark.keytab spark/blue1@EXAMPLE.COM
cd /usr/hdp/current/spark-client/
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 1 \
    --driver-memory 512m \
    --executor-memory 512m \
    --executor-cores 1 \
    lib/spark-examples*.jar 10

Setting Up Access for an End User

Each person who submits jobs must have a Kerberos account and their own keytab; end users should use their own keytabs (instead of using a headless keytab) when submitting a Spark job. This is a best practice: submitting a job under the end user keytab delivers a higher degree of audit capability.

In the following example, end user $USERNAME has their own keytab and runs the Spark Pi job in a Kerberos-enabled environment:

su $USERNAME
kinit USERNAME@YOUR-LOCAL-REALM.COM 
cd /usr/hdp/current/spark-client/
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 512m \
    --executor-memory 512m \
    --executor-cores 1 \
    lib/spark-examples*.jar 10