Chapter 9. Using Spark from R: SparkR
SparkR is an R package that provides a lightweight front end for using Apache Spark from R, supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment.
SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.
This chapter lists prerequisites, followed by a SparkR example. Here are several links to additional information:
For information about SparkR architecture and the use of SparkR in a data science workflow, see Integrate SparkR and R for Better Data Science Workflow.
For information about how to install and use R packages with SparkR, see Using R Packages with SparkR.
For additional SparkR information, see the Apache SparkR documentation for your version of Apache Spark (the link is for Spark 1, version 1.6.3).
Prerequisites
Before you run SparkR, ensure that your cluster meets the following prerequisites:
R must be installed on all nodes. Commands for installing R are specific to the operating system. For example, for CentOS you would log on as root and run the following command:
yum install R
JAVA_HOME
must be set on all nodes.
SparkR Example
The following example launches SparkR, and then uses R to create a people
DataFrame in Spark 1.6. The example then lists part of the DataFrame, and reads the DataFrame.
(For more information about Spark DataFrames, see "Using the Spark DataFrame API").
Launch SparkR:
su spark cd /usr/hdp/2.6.0.0-598/spark/bin ./sparkR
Output similar to the following displays:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.3 /_/ Spark context is available as sc, SQL context is available as sqlContext >
From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame, and list the first few rows:
sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) head(df)
You should see results similar to the following:
... eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55
Read the
people
DataFrame:people <- read.df(sqlContext, "people.json", "json") head(people)
You should see results similar to the following:
age name 1 NA Michael 2 30 Andy 3 19 Justin