Chapter 9. Using Spark from R: SparkR

SparkR is an R package that provides a lightweight front end for using Apache Spark from R, supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment.

SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.

This chapter lists prerequisites, followed by a SparkR example. Here are several links to additional information:

For information about SparkR architecture and the use of SparkR in a data science workflow, see Integrate SparkR and R for Better Data Science Workflow.
For information about how to install and use R packages with SparkR, see Using R Packages with SparkR.
For additional SparkR information, see the Apache SparkR documentation for your version of Apache Spark (the link is for Spark 1, version 1.6.3).

Prerequisites

Before you run SparkR, ensure that your cluster meets the following prerequisites:

R must be installed on all nodes. Commands for installing R are specific to the operating system. For example, for CentOS you would log on as root and run the following command:
```
yum  install R
```
JAVA_HOME must be set on all nodes.

Note: SparkR is not currently supported on SLES.

SparkR Example

The following example launches SparkR, and then uses R to create a people DataFrame in Spark 1.6. The example then lists part of the DataFrame, and reads the DataFrame. (For more information about Spark DataFrames, see "Using the Spark DataFrame API").

Launch SparkR:

su spark
cd /usr/hdp/2.6.0.0-598/spark/bin
./sparkR

Output similar to the following displays:

Welcome to
    ____              __
   / __/__  ___ _____/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  1.6.3
    /_/

Spark context is available as sc, SQL context is available as sqlContext
>

From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame, and list the first few rows:

sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful)
head(df)

You should see results similar to the following:

...
 eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Read the people DataFrame:

people <- read.df(sqlContext, "people.json", "json")
head(people)

You should see results similar to the following:

 age    name
1  NA Michael
2  30    Andy
3  19  Justin