Spark Guide
Also available as:
PDF

Spark Pi Program

To test compute-intensive tasks in Spark, the Pi example calculates pi by “throwing darts” at a circle — it generates points in the unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the square. The result approximates pi.

Here is Python code for the Spark Pi program included with Spark.

To run the Spark Pi example:

  1. Log on as a user with HDFS access--for example, your spark user, if you defined one, or hdfs. (When the job runs, the library is uploaded into HDFS, so the user running the job needs permission to write to HDFS.)

  2. Navigate to a node with a Spark client and access the spark-client directory:

    cd /usr/hdp/current/spark-client

    su spark

  3. Run the Apache Spark Pi job in yarn-client mode, using code from org.apache.spark:

    ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

    Commonly-used options include:

    • --class: The entry point for your application (e.g., org.apache.spark.examples.SparkPi)

    • --master: The master URL for the cluster (e.g., spark://23.195.26.187:7077)

    • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client

    • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).

    • <application-jar>: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

    • <application-arguments>: Arguments passed to the main method of your main class, if any.

    The job should complete without errors.

    It should produce output similar to the following. Note the value of pi in the output.

    15/08/20 17:33:38 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
    15/08/20 17:33:38 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 10.581715 s
    Pi is roughly 3.141104
    15/08/20 17:33:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}

    To view job status in a browser, navigate to the YARN ResourceManager Web UI and view Job History Server information. (For more information about checking job status and history, see Tuning and Troubleshooting Spark.)