Spark Guide
Also available as:
PDF
loading table of contents...

WordCount Program

WordCount is a simple program that counts how often a word occurs in a text file. The code builds a dataset of (String, Int) pairs called counts, and saves the dataset to a file.

The following example submits WordCount code to the scala shell:

  1. Select an input file for the Spark WordCount example. You can use any text file as input.

  2. Log on as a user with HDFS access--for example, your spark user (if you defined one) or hdfs.

    The following example uses log4j.properties as the input file:

    cd /usr/hdp/current/spark-client/

    su spark

  3. Upload the input file to HDFS:

    hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data

  4. Run the Spark shell:

    ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

    You should see output similar to the following:

    16/08/02 19:33:26 INFO SecurityManager: Changing view acls to: spark
    
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
          /_/
    
    Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
    Type in expressions to have them evaluated.
    Type :help for more information.
    16/08/02 19:33:30 INFO SparkContext: Running Spark version 1.6.2
    16/08/02 19:33:38 INFO EventLoggingListener: Logging events to hdfs:///spark-history/application_1459984611485_0009
    ...
    Spark context available as sc.
    ...
    16/08/02 19:33:42 INFO HiveContext: Initializing execution hive, version 1.2.1
    16/08/02 19:34:00 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
    16/08/02 19:34:00 INFO ClientWrapper: Inspected Hadoop version: 2.7.1.2.4.2.0-130
    16/08/02 19:34:00 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1.2.4.2.0-130
    16/08/02 19:34:01 INFO metastore: Trying to connect to metastore with URI thrift://green2:9083
    16/08/02 19:34:01 INFO metastore: Connected to metastore.
    SQL context available as sqlContext.
    
    scala> 
  5. At the scala> prompt, submit the job: type the following commands, replacing node names, file name and file location with your own values.

    val file = sc.textFile("/tmp/data")
    val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    counts.saveAsTextFile("/tmp/wordcount")
  6. To view WordCount output in the scala shell:

    scala> counts.count()

    To view the full output from within the scala shell:

    counts.toArray().foreach(println)

    To view the output using HDFS:

    1. Exit the scala shell.

    2. View WordCount job results:

      hadoop fs -ls /tmp/wordcount

      You should see output similar to the following:

      /tmp/wordcount/_SUCCESS
      /tmp/wordcount/part-00000
      /tmp/wordcount/part-00001
    3. Use the HDFS cat command to list WordCount output. For example:

      hadoop fs -cat /tmp/wordcount/part-00000