Running Applications with CDS Powered by Apache Spark

With CDS Powered by Apache Spark, you can run Apache Spark 2 applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.

Important:

By default, CDH is configured to permit any user to access the Hive Metastore. However, if you have modified the value set for the configuration property hadoop.proxyuser.hive.groups, which can be modified in Cloudera Manager by setting the Hive Metastore Access Control and Proxy User Groups Override property, your Spark application might throw exceptions when it is run. To address this issue, make sure you add the groups that contain the Spark users that you want to have access to the metastore when Spark applications are run to this property in Cloudera Manager:

In the Cloudera Manager Admin Console Home page, click the Hive service.
On the Hive service page, click the Configuration tab.
In the Search well, type hadoop.proxyuser.hive.groups to locate the Hive Metastore Access Control and Proxy User Groups Override property.
Click the plus sign (+), enter the groups you want to have access to the metastore, and then click Save Changes. You must restart the Hive Metastore Server for the changes to take effect by clicking the restart icon at the top of the page.

The Spark 2 Job Commands

With Spark 2, you use slightly different command names than in Spark 1, so that you can run both versions of Spark side-by-side without conflicts:

spark2-submit instead of spark-submit.
spark2-shell instead of spark-shell.
pyspark2 instead of pyspark.

For development and test purposes, you can also configure each host so that invoking the Spark 1 command name runs the corresponding Spark 2 executable. See Configuring Spark 2 Tools as the Default for details.

Canary Test for pyspark2 Command

The following example shows a simple pyspark2 session that refers to the SparkContext, calls the collect() function which runs a Spark 2 job, and writes data to HDFS. This sequence of operations helps to check if there are obvious configuration issues that prevent Spark 2 jobs from working at all. For the HDFS path for the output directory, substitute a path that exists on your own system.

$ hdfs dfs -mkdir /user/jdoe/spark
$ pyspark2
...
SparkSession available as 'spark'.
>>> strings = ["one","two","three"]
>>> s2 = sc.parallelize(strings)
>>> s3 = s2.map(lambda word: word.upper())
>>> s3.collect()
['ONE', 'TWO', 'THREE']
>>> s3.saveAsTextFile('hdfs:///user/jdoe/spark/canary_test')
>>> quit()
$ hdfs dfs -ls /user/jdoe/spark
Found 1 items
drwxr-xr-x   - jdoe spark-users  0 2016-08-26 14:41 /user/jdoe/spark/canary_test
$ hdfs dfs -ls /user/jdoe/spark/canary_test
Found 3 items
-rw-r--r--   3 jdoe spark-users  0 2016-08-26 14:41 /user/jdoe/spark/canary_test/_SUCCESS
-rw-r--r--   3 jdoe spark-users  4 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00000
-rw-r--r--   3 jdoe spark-users 10 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00001
$ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00000
ONE
$ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00001
TWO
THREE

Fetching Spark 2 Maven Dependencies

The Maven coordinates are a combination of groupId, artifactId and version. The groupId and artifactId are the same as for the upstream Apache Spark project. For example, for spark-core, groupId is org.apache.spark, and artifactId is spark-core_2.11, both the same as the upstream project. The version is different for the Cloudera packaging: see Using the CDS Powered by Apache Spark Maven Repository for the exact name depending on which release you are using.

Adapting the Spark WordCount App for Spark 2

The following pom fragment shows how to access a CDS Powered by Apache Spark artifact from a Maven POM.

<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.3.0.cloudera4</version>
      <scope>provided</scope>
</dependency>

Use this dependency definition to update pom.xml for the example described in Developing and Running a Spark WordCount Application. If you are using a different CDS version, see Using the CDS Powered by Apache Spark Maven Repository.

To account for changes in the Spark 2 API, before building the example, make the following updates to com.cloudera.sparkwordcount.JavaWordCount:

Add import java.util.Iterator;
Replace all instances of Iterable with Iterator.
Perform the following replacements:
- return Arrays.asList(s.split(" ")); to return Arrays.asList(s.split(" ")).iterator();
- return chars; to return chars.iterator();

Accessing the Spark 2 History Server

The Spark 2 history server is available on port 18089, rather than port 18088 as with the Spark 1 history server.

Security Considerations

Integrating Spark with Kafka