This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Configuring and Running Spark (Standalone Mode)

Configuring Spark

You can change the default configuration by modifying /etc/spark/conf/spark-env.sh. You can change the following:

SPARK_MASTER_IP, to bind the master to a different IP address or hostname
SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
SPARK_WORKER_CORES, to set the number of cores to use on this machine
SPARK_WORKER_MEMORY, to set how much memory to use (for example 1000MB, 2GB)
SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
SPARK_WORKER_INSTANCE, to set the number of worker processes per node
SPARK_WORKER_DIR, to set the working directory of worker processes

Starting, Stopping, and Running Spark

To start Spark:

$ sudo service spark-master start
$ sudo service spark-worker start

Note:

Start the master on only one node.

To stop Spark:

$ sudo service spark-worker stop
$ sudo service spark-master stop

Service logs are stored in /var/log/spark.

You can use the GUI for the Spark master at <master_host>:18080.

Testing the Spark Service

To test the Spark service, start spark-shell on one of the nodes. You can, for example, run a word count application:

val file = sc.textFile("hdfs://namenode:8020/path/to/input")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://namenode:8020/output")

You can see the application by going to the Spark Master UI, by default at http://spark-master:18080, to see the Spark Shell application, its executors and logs.

Running Spark Applications

For details on running Spark applications in the YARN Client and Cluster modes, see Running Spark Applications.

Page generated September 3, 2015.