Developing Apache Spark Applications
Also available as:
PDF

Using Spark from R: SparkR

SparkR is an R package that provides a lightweight front end for using Apache Spark from R, supporting large-scale analytics on Hortonworks Data Platform (HDP) from the R language and environment.

SparkR provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. In addition, SparkR supports distributed machine learning through MLlib.

This section lists prerequisites, followed by a SparkR example. Here are several links to additional information:

Before you run SparkR, ensure that your cluster meets the following prerequisites:

  • R must be installed on all nodes. Commands for installing R are specific to the operating system. For example, for CentOS you would log on as root and run the following command:

    yum  install R
  • JAVA_HOME must be set on all nodes.

Note: SparkR is not currently supported on SLES.

SparkR Example

The following example launches SparkR, and then uses R to create a people DataFrame in Spark. The example then lists part of the DataFrame, and reads the DataFrame. (For more information about Spark DataFrames, see "Using the Spark DataFrame API").

  1. Launch SparkR:
    su spark
    cd /usr/hdp/2.6.0.0-598/spark/bin
    ./sparkR

    Output similar to the following displays:

    Welcome to
        ____              __ 
       / __/__  ___ _____/ /__ 
      _\ \/ _ \/ _ `/ __/  '_/ 
     /___/ .__/\_,_/_/ /_/\_\   version  2.0.0
        /_/ 
    
    Spark context is available as sc, SQL context is available as sqlContext
    >
  2. From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame, and list the first few rows:
    sqlContext <- sparkRSQL.init(sc)
    df <- createDataFrame(sqlContext, faithful)
    head(df)

    You should see results similar to the following:

    ...
     eruptions waiting
    1     3.600      79
    2     1.800      54
    3     3.333      74
    4     2.283      62
    5     4.533      85
    6     2.883      55
  3. Read the people DataFrame:
    people <- read.df(sqlContext, "people.json", "json")
    head(people)

    You should see results similar to the following:

     age    name
    1  NA Michael
    2  30    Andy
    3  19  Justin