Running Apache Spark 3 applications
You can run Apache Spark 3 applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.
Limitations of Spark in CDP
spark.sql.orc.compression.codecconfig doesn't accept
spark.sql.avro.compression.codecconfig doesn't accept
avroSchemaUrlis not supported in datasource options.
The Spark 3 job commands
For development and test purposes, you can also configure each host so that invoking the Spark 2 command name runs the corresponding Spark 3 executable.
Canary test for pyspark3 command
The following example shows a simple
pyspark3 session that refers to the
SparkContext, calls the
collect() function which runs a Spark 3 job, and
writes data to HDFS. This sequence of operations helps to check if there are obvious
configuration issues that prevent Spark 3 jobs from working at all. For the HDFS path for
the output directory, substitute a path that exists on your own system.
$ hdfs dfs -mkdir /user/jdoe/spark $ pyspark3 ... SparkSession available as 'spark'. >>> strings = ["one","two","three"] >>> s2 = sc.parallelize(strings) >>> s3 = s2.map(lambda word: word.upper()) >>> s3.collect() ['ONE', 'TWO', 'THREE'] >>> s3.saveAsTextFile('hdfs:///user/jdoe/spark/canary_test') >>> quit() $ hdfs dfs -ls /user/jdoe/spark Found 1 items drwxr-xr-x - jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test $ hdfs dfs -ls /user/jdoe/spark/canary_test Found 3 items -rw-r--r-- 3 jdoe spark-users 0 2016-08-26 14:41 /user/jdoe/spark/canary_test/_SUCCESS -rw-r--r-- 3 jdoe spark-users 4 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00000 -rw-r--r-- 3 jdoe spark-users 10 2016-08-26 14:41 /user/jdoe/spark/canary_test/part-00001 $ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00000 ONE $ hdfs dfs -cat /user/jdoe/spark/canary_test/part-00001 TWO THREE
Fetching Spark 3 Maven Dependencies
The Maven coordinates are a combination of groupId, artifactId and version. The groupId
and artifactId are the same as for the upstream Apache Spark project. For example, for
spark-core, groupId is
org.apache.spark, and artifactId
spark-core_2.12, both the same as the upstream project. The version is
different for the Cloudera packaging, it contains the upstream Spark version as well as the
VERSION***].[***CDP VERSION***]-[***CDP BUILD
NUMBER***]. For example,
Accessing the Spark 3 History Server
The Spark 3 history server is available on port 18089, rather than port 18088 as with the Spark 2 history server.