Running a Crunch Application with Spark
The blog post How-to: Run a Simple Apache Spark App in CDH
5 provides a tutorial on writing, compiling and running a Spark application. Taking that article as a starting point, do the following to run Crunch with Spark.
- Add both the crunch-core and crunch-spark dependencies to your Maven project, along with the other dependencies shown in the blog post.
- Use the SparkPipeline (org.apache.crunch.impl.spark.SparkPipeline) where you would have used the MRPipeline instance in the declaration of your Crunch pipeline. The SparkPipeline will need either a String that contains the connection string for the Spark master (local for local mode, yarn-client for YARN) or an actual JavaSparkContext instance.
- Update the SPARK_SUBMIT_CLASSPATH:
export SPARK_SUBMIT_CLASSPATH=./commons-codec-1.4.jar:$SPARK_HOME/assembly/lib/*:./myapp-jar-with-dependencies.jar
- Now you can start the pipeline using your Crunch app jar-with-dependencies file using the spark-submit script, just as you would for a regular Spark pipeline.