Developing and running an Apache Spark WordCount application
This tutorial describes how to write, compile, and run a simple Spark word count application in two of the languages supported by Spark: Scala and Python. The Scala code was originally developed for a Cloudera tutorial written by Sandy Ryza.
This example application is an enhanced version of WordCount, the canonical MapReduce example. In this version of WordCount, the goal is to learn the distribution of letters in the most popular words in a corpus. The application:
- Creates a SparkConf and SparkContext. A Spark
application corresponds to an instance of the
SparkContext
class. When running a shell, theSparkContext
is created for you. - Gets a word frequency threshold.
- Reads an input set of text documents.
- Counts the number of times each word appears.
- Filters out all words that appear fewer times than the threshold.
- For the remaining words, counts the number of times each letter occurs.
In MapReduce, this requires two MapReduce applications, as well as persisting the intermediate data to HDFS between them. In Spark, this application requires about 90 percent fewer lines of code than one developed using the MapReduce API.
pom.xml
file to the
parent $HOME/sparkwordcount
directory and modify the
sections listed below. For best practices using Maven to build Spark
applications, see Building Spark Applications.