Running Apache Spark Applications

`spark-submit` command options

You specify spark-submit options using the form --option value instead of --option=value. (Use a space instead of an equals sign.)


Option	Description
`class`	For Java and Scala applications, the fully qualified classname of the class containing the main method of the application. For example, `org.apache.spark.examples.SparkPi`.
`conf`	Spark configuration property in `key`=`value` format. For values that contain spaces, surround "`key`=`value`" with quotes (as shown).
`deploy-mode`	Deployment mode: cluster and client. In cluster mode, the driver runs on worker hosts. In client mode, the driver runs locally as an external client. Use cluster mode with production jobs; client mode is more appropriate for interactive and debugging uses, where you want to see your application output immediately. To see the effect of the deployment mode when running on YARN, see Deployment Modes. Default: `client`.
`driver-class-path`	Configuration and classpath entries to pass to the driver. JARs added with --jars are automatically included in the classpath.
`driver-cores`	Number of cores used by the driver in cluster mode. Default: 1.
`driver-memory`	Maximum heap size (represented as a JVM string; for example 1024m, 2g, and so on) to allocate to the driver. Alternatively, you can use the `spark.driver.memory` property.
`files`	Comma-separated list of files to be placed in the working directory of each executor. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`jars`	Additional JARs to be loaded in the classpath of drivers and executors in cluster mode or in the executor classpath in client mode. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`master`	The location to run the application.
`packages`	Comma-separated list of Maven coordinates of JARs to include on the driver and executor classpaths. The local Maven, Maven central, and remote repositories specified in `repositories` are searched in that order. The format for the coordinates is `groupId:artifactId:version`.
`py-files`	Comma-separated list of .zip, .egg, or .py files to place on `PYTHONPATH`. For the client deployment mode, the path must point to a local file. For the cluster deployment mode, the path can be either a local file or a URL globally visible inside your cluster; see Advanced Dependency Management.
`repositories`	Comma-separated list of remote repositories to search for the Maven coordinates specified in `packages`.

Table 1. Master Values
Master	Description
local	Run Spark locally with one worker thread (that is, no parallelism).
local[`K`]	Run Spark locally with `K` worker threads. (Ideally, set this to the number of cores on your host.)
local[*]	Run Spark locally with as many worker threads as logical cores on your host.
yarn	Run using a YARN cluster manager. The cluster location is determined by `HADOOP_CONF_DIR` or `YARN_CONF_DIR`. See Configuring the Environment.

We want your opinion

How can we improve this page?

What kind of feedback do you have?