Using Spark from R: SparkR
SparkR is an R package that provides a lightweight front end for using Apache Spark from R, supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment.
SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.
This section lists prerequisites, followed by a SparkR example. Here are several links to additional information:
Before you run SparkR, ensure that your cluster meets the following prerequisites:
-
R must be installed on all nodes. Commands for installing R are specific to the operating system. For example, for CentOS you would log on as root and run the following command:
yum install R
-
JAVA_HOME
must be set on all nodes.
SparkR Example
The following example launches SparkR, and then uses R to
create a people
DataFrame
in Spark. The example then lists part of the DataFrame, and reads the
DataFrame. (For more information about Spark DataFrames, see "Using the
Spark DataFrame API").
- Launch
SparkR:
su spark cd /usr/hdp/2.6.0.0-598/spark/bin ./sparkR
Output similar to the following displays:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Spark context is available as sc, SQL context is available as sqlContext >
- From your R prompt (not the Spark shell), initialize
SQLContext, create a DataFrame, and list the first few
rows:
sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) head(df)
You should see results similar to the following:
... eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55
- Read the
people
DataFrame:people <- read.df(sqlContext, "people.json", "json") head(people)
You should see results similar to the following:
age name 1 NA Michael 2 30 Andy 3 19 Justin