R users can access Spark 2 using sparklyr
. Although Cloudera does not
ship or support sparklyr, we do recommend using sparklyr as the R interface for Cloudera Machine
Learning.
The spark_apply()
function requires the R Runtime environment to be
pre-installed on your cluster. This will likely require intervention from your cluster
administrator. For details, refer the RStudio documentation.- Install the latest version of sparklyr:
install.packages("sparklyr")
- Optionally, connect to a local or remote Spark 2 cluster:
## Connecting to Spark 2
# Connect to an existing Spark 2 cluster in YARN client mode using the spark_connect function.
library(sparklyr)
system.time(sc <- spark_connect(master = "yarn-client"))
# The returned Spark 2 connection (sc) provides a remote dplyr data source to the Spark 2 cluster.
For a complete example, see Importing Data into Cloudera Machine Learning.