WordCount Program
WordCount is a simple program that counts how often a word occurs in a text file. The
code builds a dataset of (String, Int) pairs called counts
, and saves the dataset
to a file.
The following example submits WordCount code to the scala shell:
Select an input file for the Spark WordCount example. You can use any text file as input.
Log on as a user with HDFS access--for example, your
spark
user (if you defined one) orhdfs
.The following example uses
log4j.properties
as the input file:cd /usr/hdp/current/spark-client/
su spark
Upload the input file to HDFS:
hadoop fs -copyFromLocal /etc/hadoop/conf/log4j.properties /tmp/data
Run the Spark shell:
./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
You should see output similar to the following:
16/08/02 19:33:26 INFO SecurityManager: Changing view acls to: spark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. 16/08/02 19:33:30 INFO SparkContext: Running Spark version 1.6.2 16/08/02 19:33:38 INFO EventLoggingListener: Logging events to hdfs:///spark-history/application_1459984611485_0009 ... Spark context available as sc. ... 16/08/02 19:33:42 INFO HiveContext: Initializing execution hive, version 1.2.1 16/08/02 19:34:00 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. 16/08/02 19:34:00 INFO ClientWrapper: Inspected Hadoop version: 2.7.1.2.4.2.0-130 16/08/02 19:34:00 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.1.2.4.2.0-130 16/08/02 19:34:01 INFO metastore: Trying to connect to metastore with URI thrift://green2:9083 16/08/02 19:34:01 INFO metastore: Connected to metastore. SQL context available as sqlContext. scala>
At the
scala>
prompt, submit the job: type the following commands, replacing node names, file name and file location with your own values.val file = sc.textFile("/tmp/data") val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("/tmp/wordcount")
To view WordCount output in the scala shell:
scala> counts.count()
To view the full output from within the scala shell:
counts.toArray().foreach(println)
To view the output using HDFS:
Exit the scala shell.
View WordCount job results:
hadoop fs -ls /tmp/wordcount
You should see output similar to the following:
/tmp/wordcount/_SUCCESS /tmp/wordcount/part-00000 /tmp/wordcount/part-00001
Use the HDFS cat command to list WordCount output. For example:
hadoop fs -cat /tmp/wordcount/part-00000