Chapter 5. Using Spark from R: SparkR

SparkR is an R package that provides a lightweight front end for using Apache Spark from R, thus supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment. As of Spark 1.6.2, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.

Prerequisites

Before you run SparkR, ensure that your cluster meets the following prerequisites:

R must be installed on all nodes.
JAVA_HOME must be set on all nodes.

Note: SparkR is not currently supported on SLES.

SparkR Example

The following example launches SparkR and then uses R to create a people DataFrame, list part of the DataFrame, and read the DataFrame. (For more information about Spark DataFrames, see "Using the Spark DataFrame API").

Launch SparkR:

su spark
cd /usr/hdp/2.5.0.0-3485/spark/bin
./sparkR

Output similar to the following displays:

Welcome to
    ____              __ 
   / __/__  ___ _____/ /__ 
  _\ \/ _ \/ _ `/ __/  '_/ 
 /___/ .__/\_,_/_/ /_/\_\   version  1.6.2
    /_/ 

Spark context is available as sc, SQL context is available as sqlContext
>

From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame, and list the first few rows:

sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful)
head(df)

Output similar to the following displays:

...
 eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Read the people DataFrame:

people <- read.df(sqlContext, "people.json", "json")
head(people)

Output similar to the following displays:

 age    name
1  NA Michael
2  30    Andy
3  19  Justin

Additional Resources

For additional SparkR examples, see the Apache SparkR documentation.