Spark jobs are submitted to a Hadoop cluster as YARN jobs. The developer creates a Spark application in a local environment, and tests it in a single-node Spark Standalone cluster on their developer workstation.
When the job is ready to run in a production environment, there are a few additional steps if the cluster is Kerberized:
The Spark History Server daemon needs a Kerberos account and keytab to run in a Kerberized cluster.
To submit Spark jobs in a Kerberized cluster, the account (or person) submitting jobs needs a Kerberos account & Keytab.
When you enable Kerberos for a Hadoop cluster with Ambari, Ambari sets up Kerberos for the Spark History Server and automatically creates a Kerberos account and keytab for it. For more information, see Configuring Ambari and Hadoop for Kerberos.
If you are not using Ambari, or if you plan to enable Kerberos manually for the Spark History Server, refer to "Creating Service Principals and Keytab Files for HDP" in the Setting Up Security for Manual Installs section of "Installing HDP Manually."
Here is an example showing how to create a spark
principal and keytab file for
node blue1@example.com
:
Create a Kerberos service principal:
kadmin.local -q "addprinc -randkey spark/blue1@EXAMPLE.COM"
Create the keytab:
kadmin.local -q "xst -k /etc/security/keytabs/spark.keytab spark/blue1@EXAMPLE.COM"
Create a
spark
user and add it to thehadoop
group. (Do this for every node of your cluster.)useradd spark -g hadoop
Make
spark
the owner of the newly-created keytab:chown spark:hadoop /etc/security/keytabs/spark.keytab
Limit access - make sure user
spark
is the only user with access to the keytab:chmod 400 /etc/security/keytabs/spark.keytab
The following example shows user spark
running the Spark Pi example in a
Kerberos-enabled environment:
su spark kinit -kt /etc/security/keytabs/spark.keytab spark/blue1@EXAMPLE.COM cd /usr/hdp/current/spark-client/ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
Requirements for accessing the Hive Metastore in secure mode (with Kerberos):
The Spark thrift server must be co-located with the Hive thrift server.
The
spark
user must be able to access the Hive keytab.In yarn-client mode on a secure cluster you can use HiveContext to access the Hive Metastore. (HiveContext is not supported for yarn-cluster mode on a secure cluster.)