Running Apache Spark 3 applications
Learn how to run Apache Spark 3 applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application.
You can run Apache Spark 3 applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.
The Spark 3 job commands
Learn how Spark 3 commands differ from those used with Spark 2.
With Spark 3, you use slightly different command names than with Spark 2, so that you can run both versions of Spark side-by-side without conflicts:
-
spark3-submit
instead ofspark-submit
. -
spark3-shell
instead ofspark-shell
. -
pyspark3
instead ofpyspark
.
For development and test purposes, you can also configure each host so that invoking the Spark 2 command name runs the corresponding Spark 3 executable.
Canary test for pyspark3 command
Learn how to run a pyspark3 session.
The following example shows a simple pyspark3
session
that refers to the SparkContext, calls the collect()
function which runs a Spark 3 job, and writes data to HDFS. This
sequence of operations helps to check if there are obvious configuration
issues that prevent Spark 3 jobs from working at all. For the HDFS path
for the output directory, substitute a path that exists on your own
system.
$ hdfs dfs -mkdir /user/jdoe/spark
$ pyspark3
...
SparkSession available as 'spark'.
>>> strings = ["one","two","three"]
>>> s2 = sc.parallelize(strings)
>>> s3 = s2.map(lambda word: word.upper())
>>> s3.collect()
['ONE', 'TWO', 'THREE']
>>> s3.saveAsTextFile('hdfs:///user/jdoe/spark/canary_test')
>>> quit()
$ hdfs dfs -ls /user/jdoe/spark
Found 1 items
drwxr-xr-x - jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test
$ hdfs dfs -ls /user/jdoe/spark/canary_test
Found 3 items
-rw-r--r-- 3 jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test/_SUCCESS
-rw-r--r-- 3 jdoe spark-users 4 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00000
-rw-r--r-- 3 jdoe spark-users 10 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00001
$ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00000
ONE
$ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00001
TWO
THREE
Fetching Spark 3 Maven Dependencies
Learn about the Maven dependencies for Spark 3.
The Maven coordinates are a combination of groupId, artifactId and version. The groupId
and artifactId are the same as for the upstream Apache Spark project. For example, for
spark-core
, groupId is org.apache.spark
, and artifactId
is spark-core_2.12
, both the same as the upstream project. The version is
different for the Cloudera packaging, it contains the upstream Spark version as well as the
CDP version: [**artifactId***]-[***UPSTREAM SPARK
VERSION***].[***CDP VERSION***]-[***CDP BUILD
NUMBER***]
. For example,
spark-core_2.12-3.1.2.7.2.12.0-252
.
Accessing the Spark 3 History Server
Learn how to access the Spark 3 history server.
The Spark 3 history server is available on port 18089, rather than port 18088 as with the Spark 2 history server.