This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Running Spark Applications

You can run a Spark application in three different modes:

Standalone mode, which is the default setup.
YARN client mode, which submits the Spark application to YARN, and runs the Spark driver in the client Spark process that submits the application.
YARN cluster mode, which submits the Spark application to YARN, and runs the Spark driver in the ApplicationMaster in YARN.

The following sections use a sample application, SparkPi, which is packaged with Spark and computes the value of Pi, to illustrate the three modes.

Note: The Standalone mode does not work on secure clusters. Cloudera recommends you use the YARN modes only for secure clusters, as the Standalone mode is easier to operate and debug.

Continue reading:

Running SparkPi in Standalone Mode
Running SparkPi in YARN
Building Spark Applications

Running SparkPi in Standalone Mode

# Prepare the classpath
$ source /etc/spark/conf/spark-env.sh
$ CLASSPATH=$CLASSPATH:/your/additional/classpath
$ $SPARK_HOME/bin/spark-class [<spark-config-options>]  \
    org.apache.spark.examples.SparkPi  \
    spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT 10

Various <spark-config-options> have been documented here: Apache Spark Configuration. For example, to limit each Spark executor to use 300MB of memory, you would specify, -Dspark.executor.memory=300M.

The spark-class script is still evolving and may change in the future. Its main purpose is to set up the classpath. If you would like to avoid using the script, you can prepare the classpath manually as follows:

$ source /etc/spark/conf/spark-env.sh
$ CLASSPATH=/etc/hadoop/conf
$ CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
$ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/*
$ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/lib/*
$ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/*
$ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/lib/*
$ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/*
$ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/lib/*
$ CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*
$ CLASSPATH=$CLASSPATH:$SPARK_HOME/examples/lib/*
$ CLASSPATH=/your/additional/classpath

# Run the SparkPi example
$ java -cp $CLASSPATH [<spark-config-options>]  \
     org.apache.spark.examples.SparkPi  \
     spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT 10

You can run your own compiled Spark applications the same way:

$ java -cp $CLASSPATH [<spark-config-options>] <main-class> <args>

$ $SPARK_HOME/bin/spark-class [<spark-config-options>] <main-class> <args>

Running SparkPi in YARN

For both YARN client and YARN cluster modes, first upload the Spark assembly JAR to your HDFS. Then set the SPARK_JAR environment variable to this HDFS path.

Note: If you are using Cloudera Manager, the Spark assembly JAR will be uploaded to the HDFS automatically.

$ source /etc/spark/conf/spark-env.sh
$ hdfs dfs -mkdir -p /user/spark/share/lib
$ hdfs dfs -put $SPARK_HOME/assembly/lib/spark-assembly_*.jar  \
    /user/spark/share/lib/spark-assembly.jar
$ SPARK_JAR=hdfs://<nn>:<port>/user/spark/share/lib/spark-assembly.jar

YARN Client Mode

Similar to the Standalone mode, you can use the spark-class script, or prepare the classpath manually. To run SparkPi using the spark-class script:

# Prepare the classpath
$ source /etc/spark/conf/spark-env.sh
$ SPARK_CLASSPATH=/your/additional/classpath
$ SPARK_JAR=hdfs://<nn>:<port>/user/spark/share/lib/spark-assembly.jar
$ $SPARK_HOME/bin/spark-class [<spark-config-options>]  \
    org.apache.spark.examples.SparkPi yarn-client 10

In this case, you are specifying the string yarn-client as the Spark Master, instead of the spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT URL as in the Standalone mode. This is the key difference between the Standalone mode and the YARN client mode.

For example, you can launch the Spark shell in YARN client mode as follows:

$ MASTER=yarn-client spark-shell

YARN Cluster Mode

In the YARN cluster mode, you run a main class that submits the driver and application to YARN. To run SparkPi using the spark-class script:

$ source /etc/spark/conf/spark-env.sh
$ SPARK_JAR=hdfs://<nn>:<port>/user/spark/share/lib/spark-assembly.jar
$ APP_JAR=$SPARK_HOME/examples/lib/spark-examples_<version>.jar
$ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \
      --jar $APP_JAR \
      --class org.apache.spark.examples.SparkPi \
      --args yarn-standalone \
      --args 10

Given below is a list of arguments accepted by the Client class.

--jar <your_app_jar_file>
--class <app_main_class>
--args <app_main_arguments - given multiple times for multiple args>
--num-workers <number_of_executor_machines>
--master-class <application_master_class>
--master-memory <memory_for_application_master>
--worker-memory <memory_per_executor>
--worker-cores <cores_per_executor>
--name <application_name>
--queue <queue_name>
--addJars <any_local_files_used_in_SparkContext.addJar>
--files <files_for_distributed_cache>
--archives <archives_for_distributed_cache>

Building Spark Applications

Best practices when compiling your Spark applications include:

Building a single assembly JAR that includes all the dependencies, except those for Spark and Hadoop.
Excluding any Spark and Hadoop classes from the assembly JAR, because they are already on the cluster, and part of the runtime classpath. In Maven, you can mark the Spark and Hadoop dependencies as "provided".

Page generated September 3, 2015.