Using Spark 2 from Scala

This topic describes how to set up a Scala project for CDS 2.x Powered by Apache Spark along with a few associated tasks. Cloudera Machine Learning provides an interface to the Spark 2 shell (v 2.0+) that works with Scala 2.11.

Unlike PySpark or Sparklyr, you can access a SparkContext assigned to the spark (SparkSession) and sc (SparkContext) objects on console startup, just as when using the Spark shell.

By default, the application name will be set to CML_sessionID, where sessionId is the id of the session running your Spark code. To customize this, set the property to the desired application name in a spark-defaults.conf file.

Pi.scala is a classic starting point for calculating Pi using the Montecarlo Estimation.

This is the full, annotated code sample.

//Calculate pi with Monte Carlo estimation
import scala.math.random

//make a very large unique set of 1 -> n 
val partitions = 2 
val n = math.min(100000L * partitions, Int.MaxValue).toInt 
val xs = 1 until n 

//split up n into the number of partitions we can use 
val rdd = sc.parallelize(xs, partitions).setName("'N values rdd'")

//generate a random set of points within a 2x2 square
val sample = { i =>
  val x = random * 2 - 1
  val y = random * 2 - 1
  (x, y)
}.setName("'Random points rdd'")

//points w/in the square also w/in the center circle of r=1
val inside = sample.filter { case (x, y) => (x * x + y * y < 1) }.setName("'Random points inside circle'")
val count = inside.count()
//Area(circle)/Area(square) = inside/n => pi=4*inside/n                        
println("Pi is roughly " + 4.0 * count / n)

Key points to note:

  • import scala.math.random

    Importing included packages works just as in the shell, and need only be done once.

  • Spark context (sc).
    You can access a SparkContext assigned to the variable sc on console startup.
    val rdd = sc.parallelize(xs, partitions).setName("'N values rdd'")