WordCount Program

WordCount is a simple program that counts how often a word occurs in a text file. The code builds a dataset of (String, Int) pairs called counts, and saves the dataset to a file.

The following example submits WordCount code to the scala shell:

Select an input file for the Spark WordCount example. You can use any text file as input.
Log on as a user with HDFS access--for example, your spark user (if you defined one) or hdfs.
The following example uses log4j.properties as the input file:
cd /usr/hdp/current/spark-client/
su spark
Upload the input file to HDFS:
hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data

Run the Spark shell:

./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m

You should see output similar to the following:

bin/spark-shell 

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
16/05/31 16:28:09 INFO SparkContext: Running Spark version 1.5.2
16/05/31 16:28:09 INFO SecurityManager: Changing view acls to: root
...
...
16/05/31 16:28:14 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

scala>

At the scala> prompt, submit the job: type the following commands, replacing node names, file name and file location with your own values.

val file = sc.textFile("/tmp/data")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("/tmp/wordcount")

To view WordCount output in the scala shell:
```
scala> counts.count()
```
To view the full output from within the scala shell:
```
counts.toArray().foreach(println)
```
To view the output using HDFS:
1. Exit the scala shell.
2. View WordCount job results:
```
hadoop fs -ls /tmp/wordcount
```
  You should see output similar to the following:
```
/tmp/wordcount/_SUCCESS
/tmp/wordcount/part-00000
/tmp/wordcount/part-00001
```
3. Use the HDFS cat command to list WordCount output. For example:
```
hadoop fs -cat /tmp/wordcount/part-00000
```

​WordCount Program

WordCount Program