Chapter 5. Using Spark from R: SparkR
SparkR is an R package that provides a lightweight front end for using Apache Spark from R, thus supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment. As of Spark 1.6.2, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.
Prerequisites
Before you run SparkR, ensure that your cluster meets the following prerequisites:
R must be installed on all nodes.
JAVA_HOME
must be set on all nodes.
Note: SparkR is not currently supported on SLES.
SparkR Example
The following example launches SparkR and then uses R to create a people
DataFrame, list part of the DataFrame, and read the DataFrame. (For more information about
Spark DataFrames, see "Using the Spark DataFrame API").
Launch SparkR:
su spark cd /usr/hdp/2.5.0.0-3485/spark/bin ./sparkR
Output similar to the following displays:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Spark context is available as sc, SQL context is available as sqlContext >
From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame, and list the first few rows:
sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) head(df)
Output similar to the following displays:
... eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55
Read the
people
DataFrame:people <- read.df(sqlContext, "people.json", "json") head(people)
Output similar to the following displays:
age name 1 NA Michael 2 30 Andy 3 19 Justin
Additional Resources
For additional SparkR examples, see the Apache SparkR documentation.