This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Configuring and Running Spark (Standalone Mode)

Configuring Spark

You can change the default configuration by modifying /etc/spark/conf/spark-env.sh. You can change the following:

  • SPARK_MASTER_IP, to bind the master to a different IP address or hostname
  • SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
  • SPARK_WORKER_CORES, to set the number of cores to use on this machine
  • SPARK_WORKER_MEMORY, to set how much memory to use (for example 1000MB, 2GB)
  • SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
  • SPARK_WORKER_INSTANCE, to set the number of worker processes per node
  • SPARK_WORKER_DIR, to set the working directory of worker processes

Starting, Stopping, and Running Spark

  • To start Spark:
    $ sudo service spark-master start
    $ sudo service spark-worker start
    
      Note:

    Start the master on only one node.

  • To stop Spark:
    $ sudo service spark-worker stop
    $ sudo service spark-master stop

Service logs are stored in /var/log/spark.

You can use the GUI for the Spark master at <master_host>:18080.

Testing the Spark Service

To test the Spark service, start spark-shell on one of the nodes. You can, for example, run a word count application:

val file = sc.textFile("hdfs://namenode:8020/path/to/input")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://namenode:8020/output")

You can see the application by going to the Spark Master UI, by default at http://spark-master:18080, to see the Spark Shell application, its executors and logs.

Running Spark Applications

For details on running Spark applications in the YARN Client and Cluster modes, see Running Spark Applications.
Page generated September 3, 2015.