Using Spark 2 from R
R users can access Spark 2 using sparklyr. Although Cloudera does not ship or support sparklyr, we do recommend using sparklyr as the R interface for Cloudera Data Science Workbench.
Installing sparklyr
Install the latest version of sparklyr from GitHub as follows. The latest GitHub package includes a patch that ensures compatibility with Cloudera's Distribution of Apache Spark 2. In contrast, an installation from CRAN will not include this capability.
devtools::install_github("rstudio/sparklyr")
Connecting to Spark 2
You can connect to local instances of Spark 2 as well as remote clusters.
## Connecting to Spark 2 # Connect to an existing Spark 2 cluster in YARN client mode using the spark_connect # function. library(sparklyr) system.time(sc <- spark_connect(master = "yarn-client")) # The returned Spark 2 connection (sc) provides a remote dplyr data source to the Spark 2 cluster.