Running Applications with CDS Powered by Apache Spark
With CDS Powered by Apache Spark, you can run Apache Spark 2 applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.
The Spark 2 Job Commands
With Spark 2, you use slightly different command names than in Spark 1, so that you can run both versions of Spark side-by-side without conflicts:
-
spark2-submit instead of spark-submit.
-
spark2-shell instead of spark-shell.
-
pyspark2 instead of pyspark.
For development and test purposes, you can also configure each host so that invoking the Spark 1 command name runs the corresponding Spark 2 executable. See Configuring Spark 2 Tools as the Default for details.
Canary Test for pyspark2 Command
The following example shows a simple pyspark2 session that refers to the SparkContext, calls the collect() function which runs a Spark 2 job, and writes data to HDFS. This sequence of operations helps to check if there are obvious configuration issues that prevent Spark 2 jobs from working at all. For the HDFS path for the output directory, substitute a path that exists on your own system.
$ hdfs dfs -mkdir /user/jdoe/spark $ pyspark2 ... SparkSession available as 'spark'. >>> strings = ["one","two","three"] >>> s2 = sc.parallelize(strings) >>> s3 = s2.map(lambda word: word.upper()) >>> s3.collect() ['ONE', 'TWO', 'THREE'] >>> s3.saveAsTextFile('hdfs:///user/jdoe/spark/canary_test') >>> quit() $ hdfs dfs -ls /user/jdoe/spark Found 1 items drwxr-xr-x - jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test $ hdfs dfs -ls /user/jdoe/spark/canary_test Found 3 items -rw-r--r-- 3 jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test/_SUCCESS -rw-r--r-- 3 jdoe spark-users 4 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00000 -rw-r--r-- 3 jdoe spark-users 10 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00001 $ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00000 ONE $ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00001 TWO THREE
Fetching Spark 2 Maven Dependencies
The Maven coordinates are a combination of groupId, artifactId and version. The groupId and artifactId are the same as for the upstream Apache Spark project. For example, for spark-core, groupId is org.apache.spark, and artifactId is spark-core_2.11, both the same as the upstream project. The version is different for the Cloudera packaging: see Using the CDS Powered by Apache Spark Maven Repository for the exact name depending on which release you are using.
Adapting the Spark WordCount App for Spark 2
The following pom fragment shows how to access a CDS Powered by Apache Spark artifact from a Maven POM.
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.3.0.cloudera4</version> <scope>provided</scope> </dependency>
Use this dependency definition to update pom.xml for the example described in Developing and Running a Spark WordCount Application. If you are using a different CDS version, see Using the CDS Powered by Apache Spark Maven Repository.
- Add import java.util.Iterator;
- Replace all instances of Iterable with Iterator.
- Perform the following replacements:
- return Arrays.asList(s.split(" ")); to return Arrays.asList(s.split(" ")).iterator();
- return chars; to return chars.iterator();
Accessing the Spark 2 History Server
The Spark 2 history server is available on port 18089, rather than port 18088 as with the Spark 1 history server.