Running Apache Spark 3.5.4 applications
You can run Apache Spark 3.5.4 applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.
Apache Spark 3.5.4 ships in Cloudera Runtime 7.3.1.100 and higher, and in 7.3.2.0. For upgrade considerations, behavior changes, and JDK 17 guidance, see Migrating Spark applications.
Unsupported connectors
- SparkR
Limitations of Spark in Cloudera
spark.sql.orc.compression.codecconfig doesn't acceptzsdtvalue.- Push-based shuffle is not supported
Canary test for pyspark command
The following example shows a simple pyspark session that refers to the
SparkContext, calls the collect() function which runs a Spark 3 job, and
writes data to HDFS. This sequence of operations helps to check if there are obvious
configuration issues that prevent Spark 3 jobs from working at all. For the HDFS path for
the output directory, substitute a path that exists on your own system.
$ hdfs dfs -mkdir /user/jdoe/spark
$ pyspark
...
SparkSession available as 'spark'.
>>> strings = ["one","two","three"]
>>> s2 = sc.parallelize(strings)
>>> s3 = s2.map(lambda word: word.upper())
>>> s3.collect()
['ONE', 'TWO', 'THREE']
>>> s3.saveAsTextFile('hdfs:///user/jdoe/spark/canary_test')
>>> quit()
$ hdfs dfs -ls /user/jdoe/spark
Found 1 items
drwxr-xr-x - jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test
$ hdfs dfs -ls /user/jdoe/spark/canary_test
Found 3 items
-rw-r--r-- 3 jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test/_SUCCESS
-rw-r--r-- 3 jdoe spark-users 4 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00000
-rw-r--r-- 3 jdoe spark-users 10 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00001
$ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00000
ONE
$ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00001
TWO
THREE
Fetching Spark 3 Maven Dependencies
The Maven coordinates are a combination of groupId, artifactId and version. The groupId
and artifactId are the same as for the upstream Apache Spark project. For example, for
spark-core, groupId is org.apache.spark, and artifactId
is spark-core_2.12, both the same as the upstream project. The version is
different for the Cloudera packaging, it contains the upstream Spark version as well as the
Cloudera version: [**artifactId***]-[***UPSTREAM SPARK
VERSION***].[***CDP VERSION***]-[***CLOUDERA BUILD
NUMBER***]. Use the Cloudera Runtime release notes or Maven
repository documentation for your deployment to obtain the exact version string for Spark
3.5.4.
Accessing the Spark 3 History Server
The Spark 3 history server is available on port 18089, rather than port 18088 as with the Spark 2 history server.
