Command Line Installation
Also available as:
PDF
loading table of contents...

Configuring Spark

To configure Spark, edit the following configuration files on all nodes that run Spark jobs. These configuration files reside in the Spark client conf directory /usr/hdp/current/spark-client/conf on each node.

  • java-opts

  • If you plan to use Hive with Spark, hive-site.xml

  • spark-env.sh

  • spark-defaults.conf

  • spark-thrift-sparkconf.conf

[Note]Note

Note: the following instructions are for a non-Kerberized cluster.

Create a java-opts file in the Spark client /conf directory. Add the following line to the file.

-Dhdp.version=<HDP-version>

For example:

-Dhdp.version=2.5.5.0-1245

hive-site.xml

If you plan to use Hive with Spark, create a hive-site.xml file in the Spark client SPARK_HOME/conf directory. (Note: if you installed the Spark tech preview you can skip this step.)

Edit the file so that it contains only the hive.metastore.uris property. Make sure that the hostname points to the URI where the Hive Metastore is running.

[Important]Important

hive-site.xml contains a number of properties that are not relevant to or supported by the Spark thrift server. Ensure that your Spark hive-site.xml file contains only the following configuration property.

For example:

<property>
     <name>hive.metastore.uris</name>
     <value>thrift://c6401.ambari.apache.org:9083</value>
     <description>URI for client contact metastore server</description>
</property>

spark-env.sh

Create a spark-env.sh file in the Spark client /conf directory, and make sure the file has the following entries:

# Location where log files are stored (default: ${SPARK_HOME}/logs)
# This can be any directory where the spark user has R/W access
export SPARK_LOG_DIR=/var/log/spark

# Location of the pid file (default: /tmp)
# This can be any directory where the spark user has R/W access
export SPARK_PID_DIR=/var/run/spark

These settings are required for starting Spark services (for example, the History Service and the Thrift server). The user who starts Spark services needs to have read and write permissions to the log file and PID directory. By default these files are in the $SPARK_HOME directory, typically owned by root in RMP installation.

We recommend that you set HADOOP_CONF_DIR to the appropriate directory; for example:

set HADOOP_CONF_DIR=/etc/hadoop/conf

This minimizes the amount of work you need to do to set up environment variables before running Spark applications.

spark-defaults.conf

Edit the spark-defaults.conf file in the Spark client /conf directory.

  • Make sure the following values are specified, including hostname and port. For example:

    spark.yarn.historyServer.address c6401.ambari.apache.org:18080
    spark.history.ui.port 18080
    spark.eventLog.dir hdfs:///spark-history
    spark.eventLog.enabled true
    spark.history.fs.logDirectory hdfs:///spark-history
  • Delete the spark.yarn.services property, if specified in the file.

If you submit jobs programmatically in a way that spark-env.sh is not executed during the submit step, or if you wish to specify a different cluster version than the version installed on the client, set the following two additional property values:

spark.driver.extraJavaOptions -Dhdp.version=<HDP-version>
spark.yarn.am.extraJavaOptions -Dhdp.version=<HDP-version>

For example:

spark.driver.extraJavaOptions -Dhdp.version=2.5.5.0-1245
spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.5.0-1245

spark-thrift-sparkconf.conf

Add the following properties and values to the spark-thrift-sparkconf.conf file:

spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.history.fs.logDirectory hdfs:///spark-history

Create a spark User

To use the Spark History Service, run Hive queries as the spark user, or run Spark jobs; the associated user must have sufficient HDFS access. One way of ensuring this is to add the user to the hdfs group.

The following example creates a spark user:

  • Create the spark user on all nodes. Add it to the hdfs group.

    useradd spark This command is only required for tarball spark installs, not rpm-based installs.

    usermod -a -G hdfs spark

  • Create the spark user directory under /user/spark:

    sudo su $HDFS_USER

    hdfs dfs -mkdir -p /user/spark

    hdfs dfs -chown spark:spark /user/spark

    hdfs dfs -chmod -R 755 /user/spark

Create an HDFS Directory

As the hdfs service user, create an HDFS directory called spark-history with user:spark, user group:hadoop, and permissions = 777:

hdfs dfs -mkdir /spark-history
hdfs dfs -chown -R spark:hadoop /spark-history
hdfs dfs -chmod -R 777 /spark-history