Running Apache Spark ApplicationsPDF version

Running sample Spark applications

You can use the following sample Spark Pi and Spark WordCount sample programs to validate your Spark installation and explore how to run Spark jobs from the command line and Spark shell.

You can test your Spark installation by running the following compute-intensive example, which calculates pi by “throwing darts” at a circle. The program generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi.

Follow these steps to run the Spark Pi example:

  1. Authenticate using kinit:
    kinit <username>
  2. Run the Apache Spark Pi job in yarn-client mode, using code from org.apache.spark:
    spark-submit --class org.apache.spark.examples.SparkPi \
        --master yarn-client \
        --num-executors 1 \
        --driver-memory 512m \
        --executor-memory 512m \
        --executor-cores 1 \
        examples/jars/spark-examples*.jar 10

    Commonly used options include the following:

    --class

    The entry point for your application: for example, org.apache.spark.examples.SparkPi.

    --master

    The master URL for the cluster: for example, spark://23.195.26.187:7077.

    --deploy-mode

    Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (default is client).

    --conf

    Arbitrary Spark configuration property in key=value format. For values that contain spaces, enclose “key=value” in double quotation marks.

    <application-jar>

    Path to a bundled jar file that contains your application and all dependencies. The URL must be globally visible inside of your cluster: for instance, an hdfs:// path or a file:// path that is present on all nodes.

    <application-arguments>

    Arguments passed to the main method of your main class, if any.

    Your job should produce output similar to the following. Note the value of pi in the output.

    17/03/22 23:21:10 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.302805 s
    Pi is roughly 3.1445191445191445
    You can also view job status in a browser by navigating to the YARN ResourceManager Web UI and viewing job history server information. (For more information about checking job status and history, see "Tuning Spark" in this guide.)