Configuring Spark
To configure Spark, edit the following configuration files on all nodes that run Spark
jobs. These configuration files reside in the Spark client conf directory
/usr/hdp/current/spark-client/conf
on each node.
If you plan to use Hive with Spark,
hive-site.xml
spark-env.sh
spark-defaults.conf
spark-thrift-sparkconf.conf
Note | |
---|---|
Note: the following instructions are for a non-Kerberized cluster. |
hive-site.xml
If you plan to use Hive with Spark, create a hive-site.xml
file in the
Spark client SPARK_HOME/conf
directory. (Note: if you installed the Spark
tech preview you can skip this step.)
Edit the
file so that it contains only the hive.metastore.uris
property. Make sure
that the hostname
points to the URI where the Hive Metastore is running.
Important | |
---|---|
|
For example:
<property> <name>hive.metastore.uris</name> <value>thrift://c6401.ambari.apache.org:9083</value> <description>URI for client contact metastore server</description> </property>
spark-env.sh
Create a spark-env.sh
file in the Spark client /conf directory, and make
sure the file has the following entries:
# Location where log files are stored (default: ${SPARK_HOME}/logs) # This can be any directory where the spark user has R/W access export SPARK_LOG_DIR=/var/log/spark # Location of the pid file (default: /tmp) # This can be any directory where the spark user has R/W access export SPARK_PID_DIR=/var/run/spark
These settings are required for starting Spark services (for example, the History
Service and the Thrift server). The user who starts Spark services needs to have read
and write permissions to the log file and PID directory. By default these files are in
the $SPARK_HOME
directory, typically owned by root in RMP installation.
We recommend that you set HADOOP_CONF_DIR
to the appropriate directory;
for example:
set HADOOP_CONF_DIR=/etc/hadoop/conf
This minimizes the amount of work you need to do to set up environment variables before running Spark applications.
spark-defaults.conf
Edit the spark-defaults.conf
file in the Spark client
/conf
directory.
Make sure the following values are specified, including hostname and port. For example:
spark.yarn.historyServer.address c6401.ambari.apache.org:18080 spark.history.ui.port 18080 spark.eventLog.dir hdfs:///spark-history spark.eventLog.enabled true spark.history.fs.logDirectory hdfs:///spark-history
Delete the
spark.yarn.services
property, if specified in the file.
If you submit jobs programmatically in a way that spark-env.sh
is not
executed during the submit step, or if you wish to specify a different cluster version
than the version installed on the client, set the following two additional property
values:
spark.driver.extraJavaOptions -Dhdp.version=<HDP-version> spark.yarn.am.extraJavaOptions -Dhdp.version=<HDP-version>
For example:
spark.driver.extraJavaOptions -Dhdp.version=2.6.1.0-3475 spark.yarn.am.extraJavaOptions -Dhdp.version=2.6.1.0-3475
spark-thrift-sparkconf.conf
Add the following properties and values to the
spark-thrift-sparkconf.conf
file:
spark.eventLog.dir hdfs:///spark-history spark.eventLog.enabled true spark.history.fs.logDirectory hdfs:///spark-history
Create a spark
User
To use the Spark History Service, run Hive queries as the spark
user, or
run Spark jobs; the associated user must have sufficient HDFS access. One way of
ensuring this is to add the user to the hdfs
group.
The following example creates a spark
user:
Create the
spark
user on all nodes. Add it to thehdfs
group.useradd spark
This command is only required for tarball spark installs, not rpm-based installs.usermod -a -G hdfs spark
Create the
spark
user directory under/user/spark
:sudo su $HDFS_USER
hdfs dfs -mkdir -p /user/spark
hdfs dfs -chown spark:spark /user/spark
hdfs dfs -chmod -R 755 /user/spark
Create an HDFS Directory
As the hdfs service user, create an HDFS directory called spark-history with user:spark, user group:hadoop, and permissions = 777:
hdfs dfs -mkdir /spark-history hdfs dfs -chown -R spark:hadoop /spark-history hdfs dfs -chmod -R 777 /spark-history