Running Spark Applications
You can run a Spark application in three different modes:
- Standalone mode, which is the default setup.
- YARN client mode, which submits the Spark application to YARN, and runs the Spark driver in the client Spark process that submits the application.
- YARN cluster mode, which submits the Spark application to YARN, and runs the Spark driver in the ApplicationMaster in YARN.

Running SparkPi in Standalone Mode
# Prepare the classpath $ source /etc/spark/conf/ $ CLASSPATH=$CLASSPATH:/your/additional/classpath $ $SPARK_HOME/bin/spark-class [<spark-config-options>] \ org.apache.spark.examples.SparkPi \ spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT 10
Various <spark-config-options> have been documented here: Apache Spark Configuration. For example, to limit each Spark executor to use 300MB of memory, you would specify, -Dspark.executor.memory=300M.
$ source /etc/spark/conf/ $ CLASSPATH=/etc/hadoop/conf $ CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/* $ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/* $ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/lib/* $ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/* $ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/lib/* $ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/* $ CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/lib/* $ CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/* $ CLASSPATH=$CLASSPATH:$SPARK_HOME/examples/lib/* $ CLASSPATH=/your/additional/classpath # Run the SparkPi example $ java -cp $CLASSPATH [<spark-config-options>] \ org.apache.spark.examples.SparkPi \ spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT 10
You can run your own compiled Spark applications the same way:
$ java -cp $CLASSPATH [<spark-config-options>] <main-class> <args>
$ $SPARK_HOME/bin/spark-class [<spark-config-options>] <main-class> <args>
Running SparkPi in YARN

$ source /etc/spark/conf/ $ hdfs dfs -mkdir -p /user/spark/share/lib $ hdfs dfs -put $SPARK_HOME/assembly/lib/spark-assembly_*.jar \ /user/spark/share/lib/spark-assembly.jar $ SPARK_JAR=hdfs://<nn>:<port>/user/spark/share/lib/spark-assembly.jar
YARN Client Mode
Similar to the Standalone mode, you can use the spark-class script, or prepare the classpath manually. To run SparkPi using the spark-class script:
# Prepare the classpath $ source /etc/spark/conf/ $ SPARK_CLASSPATH=/your/additional/classpath $ SPARK_JAR=hdfs://<nn>:<port>/user/spark/share/lib/spark-assembly.jar $ $SPARK_HOME/bin/spark-class [<spark-config-options>] \ org.apache.spark.examples.SparkPi yarn-client 10
In this case, you are specifying the string yarn-client as the Spark Master, instead of the spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT URL as in the Standalone mode. This is the key difference between the Standalone mode and the YARN client mode.
$ MASTER=yarn-client spark-shell
YARN Cluster Mode
$ source /etc/spark/conf/ $ SPARK_JAR=hdfs://<nn>:<port>/user/spark/share/lib/spark-assembly.jar $ APP_JAR=$SPARK_HOME/examples/lib/spark-examples_<version>.jar $ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \ --jar $APP_JAR \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --args 10
--jar <your_app_jar_file> --class <app_main_class> --args <app_main_arguments - given multiple times for multiple args> --num-workers <number_of_executor_machines> --master-class <application_master_class> --master-memory <memory_for_application_master> --worker-memory <memory_per_executor> --worker-cores <cores_per_executor> --name <application_name> --queue <queue_name> --addJars <any_local_files_used_in_SparkContext.addJar> --files <files_for_distributed_cache> --archives <archives_for_distributed_cache>
Building Spark Applications
- Building a single assembly JAR that includes all the dependencies, except those for Spark and Hadoop.
- Excluding any Spark and Hadoop classes from the assembly JAR, because they are already on the cluster, and part of the runtime classpath. In Maven, you can mark the Spark and Hadoop dependencies as "provided".
