Using Spark 2 from Scala
This topic describes how to set up a Scala project for CDS 2.x Powered by Apache Spark along with a few associated tasks. Cloudera Machine Learning provides an interface to the Spark 2 shell (v 2.0+) that works with Scala 2.11.
Unlike PySpark or Sparklyr, you can access a SparkContext assigned to the spark (SparkSession) and sc (SparkContext) objects on console startup, just as when using the Spark shell.
By default, the application name will be set to CML_sessionID, where
sessionId is the ID of the session running your Spark code. To customize this, set the
spark.app.name
property to the desired application name in a
spark-defaults.conf
file.
Pi.scala
is a classic starting point for calculating Pi using the Monte Carlo
estimation.
This is the full, annotated code sample.
//Calculate pi with Monte Carlo estimation
import scala.math.random
//make a very large unique set of 1 -> n
val partitions = 2
val n = math.min(100000L * partitions, Int.MaxValue).toInt
val xs = 1 until n
//split up n into the number of partitions we can use
val rdd = sc.parallelize(xs, partitions).setName("'N values rdd'")
//generate a random set of points within a 2x2 square
val sample = rdd.map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
(x, y)
}.setName("'Random points rdd'")
//points w/in the square also w/in the center circle of r=1
val inside = sample.filter { case (x, y) => (x * x + y * y < 1) }.setName("'Random points inside circle'")
val count = inside.count()
//Area(circle)/Area(square) = inside/n => pi=4*inside/n
println("Pi is roughly " + 4.0 * count / n)
Key points to note:
-
import scala.math.random
Importing included packages works just as in the shell, and need only be done once.
-
Spark context (sc).
You can access a SparkContext assigned to the variable sc on console startup.
val rdd = sc.parallelize(xs, partitions).setName("'N values rdd'")