Running WordCount v1.0
You can run WordCount v1.0 in your own Hadoop environment. If you do not have one available, you can download and install the Cloudera QuickStart VM.
- Before you run the sample, you must create input and output locations in HDFS. Use the following commands to create the input directory/user/cloudera/wordcount/input in HDFS:
$ sudo su hdfs $ hadoop fs -mkdir /user/cloudera $ hadoop fs -chown cloudera /user/cloudera $ exit $ sudo su cloudera $ hadoop fs -mkdir /user/cloudera/wordcount /user/cloudera/wordcount/input
- Create sample text files to use as input, and move them to the/user/cloudera/wordcount/input directory in HDFS. You can use any files you choose; for
convenience, the following shell commands create a few small input files for illustrative purposes. The Makefile also contains most of the commands that follow.
$ echo "Hadoop is an elephant" > file0 $ echo "Hadoop is as yellow as can be" > file1 $ echo "Oh what a yellow fellow is Hadoop" > file2 $ hadoop fs -put file* /user/cloudera/wordcount/input
- Compile the WordCount class.
To compile in a package installation of CDH:
$ mkdir -p build $ javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/* WordCount.java -d build -Xlint
To compile in a parcel installation of CDH:
$ mkdir -p build $ javac -cp /opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/* \ WordCount.java -d build -Xlint
- Create a JAR file for the WordCount application.
$ jar -cvf wordcount.jar -C build/ .
- Run the WordCount application from the JAR file, passing the paths to the input and output directories in HDFS.
$ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output
When you look at the output, all of the words are listed in UTF-8 alphabetical order (capitalized words first). The number of occurrences from all input files has been reduced to a single sum for each word.
$ hadoop fs -cat /user/cloudera/wordcount/output/* Hadoop 3 Oh 1 a 1 an 1 as 2 be 1 can 1 elephant 1 fellow 1 is 3 what 1 yellow 2
- If you want to run the sample again, you first need to remove the output directory. Use the following command.
$ hadoop fs -rm -r /user/cloudera/wordcount/output