Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode

Before you start, uninstall MRv1 if necessary

If you have already installed MRv1 following the steps in the previous section, you now need to uninstall hadoop-0.20-conf-pseudo before running YARN. Proceed as follows.

Stop the daemons:

$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done
$ for x in 'cd /etc/init.d ; ls hadoop-0.20-mapreduce-*' ; do sudo service $x stop ; done

Remove hadoop-0.20-conf-pseudo:
- On Red Hat-compatible systems:
```
$ sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
```
- On SLES systems:
```
$ sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
```
- On Ubuntu or Debian systems:
```
$ sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
```
Note:
In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.

Important:

If you have not already done so, install the Oracle Java Development Kit (JDK) before deploying CDH 5. Follow these instructions.

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

Download the CDH 5 Package

Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

OS Version	Click this Link
Red Hat/CentOS/Oracle 5	Red Hat/CentOS/Oracle 5 link
Red Hat/CentOS/Oracle 6	Red Hat/CentOS/Oracle 6 link

Install the RPM.
For Red Hat/CentOS/Oracle 5:
```
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm 
```
For Red Hat/CentOS/Oracle 6 (64-bit):
```
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
```
Note:
For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see Installing CDH 5 On Red Hat-compatible systems.

Install CDH 5

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
- For Red Hat/CentOS/Oracle 5 systems:
```
$ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
```
- For Red Hat/CentOS/Oracle 6 systems:
```
$ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
```
Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
```
$ sudo yum install hadoop-conf-pseudo
```

On SLES systems, do the following:

Download and install the CDH 5 package

Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).
Install the RPM:
```
$ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm
```
Note:
For instructions on how to add a CDH 5 SLES repository or build your own CDH 5 SLES repository, see Installing CDH 5 On SLES systems.

Install CDH 5

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
- For all SLES systems:
```
$ sudo rpm --import http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
```
Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
```
$ sudo zypper install hadoop-conf-pseudo 
```

On Ubuntu and other Debian systems, do the following:

Download and install the package

Download the CDH 5 "1-click Install" package:

OS Version	Click this Link
Wheezy	Wheezy link
Precise	Precise link

Install the package. Do one of the following:
- Choose Open with in the download window to use the package manager.
- Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
```
sudo dpkg -i cdh5-repository_1.0_all.deb
```

Note:

For instructions on how to add a CDH 5 Debian repository or build your own CDH 5 Debian repository, see Installing CDH 5 On Ubuntu or Debian systems.

Install CDH 5

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

For Ubuntu Lucid systems:

$ curl -s http://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -

For Ubuntu Precise systems:

$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -

For Debian Squeeze systems:

$ curl -s http://archive.cloudera.com/cdh5/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -

Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
```
$ sudo apt-get update
$ sudo apt-get install hadoop-conf-pseudo
```

Starting Hadoop and Verifying it is Working Properly

For YARN, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, secondarynamenode, resourcemanager, datanode, and nodemanager.

To view the files on Red Hat or SLES systems:

$ rpm -ql hadoop-conf-pseudo

To view the files on Ubuntu systems:

$ dpkg -L hadoop-conf-pseudo

The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.

Note:

The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

To start Hadoop, proceed as follows.

Step 1: Format the NameNode.

Before starting the NameNode for the first time you must format the file system.

$ sudo -u hdfs hdfs namenode -format

Note:

Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.

Important:

In earlier releases, the hadoop-conf-pseudo package automatically formatted HDFS on installation. In CDH 5, you must do this explicitly.

Step 2: Start HDFS

$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

Step 3: Create the /tmp, Staging and Log Directories

Remove the old /tmp if it exists:
```
$ sudo -u hdfs hadoop fs -rm -r /tmp
```

Create the new directories and set permissions:

$ sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history/done_intermediate
$ sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
$ sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
$ sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

Note: You need to create /var/log/hadoop/yarn because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in yarn-site.xml.

Step 4: Verify the HDFS File Structure:

Run the following command:

$ sudo -u hdfs hadoop fs -ls -R /

You should see the following directory structure:

drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn

Step 5: Start YARN

$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start
$ sudo service hadoop-mapreduce-historyserver start

Step 6: Create User Directories

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

$ sudo -u hdfs hadoop fs -mkdir /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER

Running an example application with YARN

Create a home directory on HDFS for the user who will be running the job (for example, joe):
```
$ sudo -u hdfs hadoop fs -mkdir /user/joe
$ sudo -u hdfs hadoop fs -chown joe /user/joe
```
Do the following steps as the user joe.

Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:

$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
Found 3 items:
-rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml
-rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml
-rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml

Set HADOOP_MAPRED_HOME for user joe:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Run an example Hadoop job to grep with a regular expression in your input data.

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'

After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
```
$ hadoop fs -ls
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
```
You can see that there is a new directory called output23.

List the output files.

$ hadoop fs -ls output23
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS
-rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000

Read the results in the output file.

$ hadoop fs -cat output23/part-r-00000 | head
1 dfs.safemode.min.datanodes
1 dfs.safemode.extension
1 dfs.replication
1 dfs.permissions.enabled
1 dfs.namenode.name.dir
1 dfs.namenode.checkpoint.dir
1 dfs.datanode.data.dir