Chapter 19. Installing and Configuring Apache Spark

This section describes how to install and configure Apache Spark for HDP:

For more information about Spark on HDP (including how to install Spark using Ambari), see the Apache Spark Quick Start Guide.

 1. Spark Prerequisites

Before installing Spark, make sure your cluster meets the following prerequisites:

 

Table 19.1. Spark Cluster Prerequisites

Item

Prerequisite

Cluster Stack Version

HDP 2.2.4 or later

(Optional) Ambari

Version 2.0.0

Components

Spark requires HDFS and YARN


[Note]Note

If you installed the Spark tech preview, save any configuration changes you made to the tech preview environment. Install Spark 1.3.1, and then update the configuration with your changes.

 2. Installing Spark

To install Spark, run the following commands as root:

  • For RHEL or CentOS:

    yum install spark yum install spark-python

  • For SLES:

    zypper install spark zypper install spark-python

  • For Ubuntu and Debian:

    apt-get install spark apt-get install spark-python

When you install Spark, two directories will be created:

  • /usr/hdp/current/spark-client for submitting Spark jobs

  • /usr/hdp/current/spark-history for launching Spark master processes, such as the Spark history server

 3. Configuring Spark

To configure Spark, edit the following configuration files on all nodes that will run Spark jobs. These configuration files reside in the Spark client conf directory (/usr/hdp/current/spark-client/conf) on each node.

  • java-opts

  • If you plan to use Hive with Spark, hive-site.xml

  • spark-env.sh

  • spark-defaults.conf

[Note]Note

Note: the following instructions are for a non-Kerberized cluster.

java-opts

Create a java-opts file in the Spark client /conf directory. Add the following line to the file.

-Dhdp.version=<HDP-version>

For example:

-Dhdp.version=2.2.8.0-3150

hive-site.xml

If you plan to use Hive with Spark, create a hive-site.xml file in the Spark client /conf directory. (Note: if you installed the Spark tech preview you can skip this step.)

In this file, add the hive.metastore.uris property and specify the Hive metastore as its value:

<property>
     <name>hive.metastore.uris</name>
     <value>thrift://c6401.ambari.apache.org:9083</value>
</property>

spark-env.sh

Create a spark-env.sh file in the Spark client /conf directory, and make sure the file has the following entries:

# Location where log files are stored (default: ${SPARK_HOME}/logs)
# This can be any directory where the spark user has R/W access
export SPARK_LOG_DIR=/var/log/spark

# Location of the pid file (default: /tmp)
# This can be any directory where the spark user has R/W access
export SPARK_PID_DIR=/var/run/spark

if [ -d "/etc/tez/conf/" ]; then
export TEZ_CONF_DIR=/etc/tez/conf
else
export TEZ_CONF_DIR=
fi

These settings are required for starting Spark services (for example, the History Service and the Thrift server). The user who starts Spark services needs to have read and write permissions to the log file and PID directory. By default these files are in the $SPARK_HOME directory, typically owned by root in RMP installation.

We recommend that you set HADOOP_CONF_DIR to the appropriate directory; for example:

set HADOOP_CONF_DIR=/etc/hadoop/conf

This will minimize the amount of work you need to do to set up environment variables before running Spark applications.

spark-defaults.conf

Edit the spark-defaults.conf file in the Spark client /conf directory. Make sure the following values are specified, including hostname and port. (Note: if you installed the tech preview, these will already be in the file.) For example:

spark.yarn.historyServer.address c6401.ambari.apache.org:18080
spark.history.ui.port 18080
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.driver.extraJavaOptions -Dhdp.version=2.2.8.0-3150
spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.8.0-3150

Create a Spark user

To use the Spark History Service, run Hive queries as the spark user, or run Spark jobs, the associated user must have sufficient HDFS access. One way of ensuring this is to add the user to the hdfs group.

The following example creates a spark user:

  • Create the spark user on all nodes. Add it to the hdfs group.

    useradd spark

    usermod -a -G hdfs spark

  • Create the spark user directory under /user/spark:

    sudo su $HDFS_USER

    hdfs dfs -mkdir -p /user/spark

    hdfs dfs -chown spark:spark /user/spark

    hdfs dfs -chmod -R 755 /user/spark

 4. Validating Spark

To validate the Spark installation, run the following Spark jobs:

  • Spark Pi example

  • Spark WordCount example

For detailed instructions, see the Apache Spark Quick Start Guide.


loading table of contents...