Using Spark 2 from R

R users can access Spark 2 using SparkR or sparklyr. Although Cloudera does not ship or support SparkR or sparklyr, we recommend using sparklyr as the R interface for Cloudera Data Science Workbench.

Installing sparklyr

You can install sparklyr from GitHub.

install.packages("sparklyr")

Connecting to Spark 2

You can connect to local instances of Spark 2 as well as remote clusters.

## Connecting to Spark 2
# Connect to an existing Spark 2 cluster in YARN client mode using the spark_connect
# function.
library(sparklyr)
system.time(sc <- spark_connect(master = "yarn-client"))
# The returned Spark 2 connection (sc) provides a remote dplyr data source to the Spark 2 cluster.