Apache Spark Component Guide
Also available as:
PDF
loading table of contents...

Running Spark Applications

You can use the following sample programs, Spark Pi and Spark WordCount, to validate your Spark installation and explore running Spark jobs from the command line and Spark shell.

Spark Pi

You can test your Spark installation by running the following compute-intensive example, which calculates pi by “throwing darts” at a circle. The program generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi.

Follow these steps to run the Spark Pi example:

  1. Log on as a user with Hadoop Distributed File System (HDFS) access: for example, your spark user, if you defined one, or hdfs.

    When the job runs, the library is uploaded into HDFS, so the user running the job needs permission to write to HDFS.

  2. Navigate to a node with a Spark client and access the spark-client directory:

    cd /usr/hdp/current/spark-client

    su spark

  3. Run the Apache Spark Pi job in yarn-client mode, using code from org.apache.spark:

    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

    Commonly used options include the following:

    --class

    The entry point for your application: for example, org.apache.spark.examples.SparkPi.

    --master

    The master URL for the cluster: for example, spark://23.195.26.187:7077.

    --deploy-mode

    Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (default is client).

    --conf

    Arbitrary Spark configuration property in key=value format. For values that contain spaces, enclose “key=value” in double quotation marks.

    <application-jar>

    Path to a bundled jar file that contains your application and all dependencies. The URL must be globally visible inside of your cluster: for instance, an hdfs:// path or a file:// path that is present on all nodes.

    <application-arguments>

    Arguments passed to the main method of your main class, if any.

    Your job should produce output similar to the following. Note the value of pi in the output.

    16/08/22 14:28:35 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 1.721177 s
    Pi is roughly 3.141296
    16/08/22 14:28:35 INFO spark.ContextCleaner: Cleaned accumulator 1

    You can also view job status in a browser by navigating to the YARN ResourceManager Web UI and viewing job history server information. (For more information about checking job status and history, see Tuning and Troubleshooting Spark.)

WordCount

WordCount is a simple program that counts how often a word occurs in a text file. The code builds a dataset of (String, Int) pairs called counts, and saves the dataset to a file.

The following example submits WordCount code to the Scala shell:

  1. Select an input file for the Spark WordCount example.

    You can use any text file as input.

  2. Log on as a user with HDFS access: for example, your spark user (if you defined one) or hdfs.

    The following example uses log4j.properties as the input file:

    cd /usr/hdp/current/spark-client/

    su spark

  3. Upload the input file to HDFS:

    hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data

  4. Run the Spark shell:

    ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

    You should see output similar to the following:

    16/08/22 19:33:26 INFO SecurityManager: Changing view acls to: spark
    
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
          /_/
    
    Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
    Type in expressions to have them evaluated.
    Type :help for more information.
    16/08/22 19:33:30 INFO SparkContext: Running Spark version 1.6.2
    16/08/22 19:33:38 INFO EventLoggingListener: Logging events to hdfs:///spark-history/application_1459984611485_0009
    ...
    Spark context available as sc.
    ...
    16/08/22 19:33:42 INFO HiveContext: Initializing execution hive, version 1.2.1
    16/08/22 19:34:00 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
    16/08/22 19:34:00 INFO ClientWrapper: Inspected Hadoop version: 2.7.1.2.5.0.0-130
    16/08/22 19:34:00 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1.2.5.0.0-130
    16/08/22 19:34:01 INFO metastore: Trying to connect to metastore with URI thrift://green2:9083
    16/08/22 19:34:01 INFO metastore: Connected to metastore.
    SQL context available as sqlContext.
    
    scala> 
  5. At the scala> prompt, submit the job by typing the following commands, replacing node names, file name, and file location with your own values:

    val file = sc.textFile("/tmp/data")
    val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    counts.saveAsTextFile("/tmp/wordcount")
  6. Use one of the following approaches to view job output:

    • View output in the Scala shell:

      scala> counts.count()
    • View the full output from within the Scala shell:

      scala> counts.toArray().foreach(println)
    • View the output using HDFS:

      1. Exit the Scala shell.

      2. View WordCount job status:

        hadoop fs -ls /tmp/wordcount

        You should see output similar to the following:

        /tmp/wordcount/_SUCCESS
        /tmp/wordcount/part-00000
        /tmp/wordcount/part-00001
      3. Use the HDFS cat command to list WordCount output:

        hadoop fs -cat /tmp/wordcount/part-00000