Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode

Before you start, uninstall MRv1 if necessary

If you have already installed MRv1 following the steps in the previous section, you now need to uninstall hadoop-0.20-conf-pseudo before running YARN. Proceed as follows.

  1. Stop the daemons:
    $ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done 
    $ for x in 'cd /etc/init.d ; ls hadoop-0.20-mapreduce-*' ; do sudo service $x stop ; done
  2. Remove hadoop-0.20-conf-pseudo:
    • On Red Hat-compatible systems:
      $ sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    • On SLES systems:
      $ sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
    • On Ubuntu or Debian systems:
      $ sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*

    In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

Download the CDH 5 Package

  1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).
    OS Version Click this Link
    Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
    Red Hat/CentOS/Oracle 6 Red Hat/CentOS/Oracle 6 link
  2. Install the RPM.

    For Red Hat/CentOS/Oracle 5:

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm 

    For Red Hat/CentOS/Oracle 6 (64-bit):

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm

    For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see Installing CDH 5 On Red Hat-compatible systems.

Install CDH 5

  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For Red Hat/CentOS/Oracle 5 systems:
      $ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
    • For Red Hat/CentOS/Oracle 6 systems:
      $ sudo rpm --import https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
  2. Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
    $ sudo yum install hadoop-conf-pseudo

On SLES systems, do the following:

Download and install the CDH 5 package

  1. Download the CDH 5 "1-click Install" package.

    Click this link, choose Save File, and save it to a directory to which you have write access (for example, your home directory).

  2. Install the RPM:
    $ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm

    For instructions on how to add a CDH 5 SLES repository or build your own CDH 5 SLES repository, see Installing CDH 5 On SLES systems.

Install CDH 5

  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For all SLES systems:
      $ sudo rpm --import https://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
  2. Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
    $ sudo zypper install hadoop-conf-pseudo 

On Ubuntu and other Debian systems, do the following:

Download and install the package

  1. Download the CDH 5 "1-click Install" package:
    OS Version Click this Link
    Wheezy Wheezy link
    Precise Precise link
    Trusty Trusty link
  2. Install the package by doing one of the following:
    • Choose Open with in the download window to use the package manager.
    • Choose Save File, save the package to a directory to which you have write access (for example, your home directory), and install it from the command line. For example:
      sudo dpkg -i cdh5-repository_1.0_all.deb

Install CDH 5

  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For Ubuntu Lucid systems:
      $ curl -s https://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
    • For Ubuntu Precise systems:
      $ curl -s https://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
    • For Debian Squeeze systems:
      $ curl -s https://archive.cloudera.com/cdh5/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -
  2. Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
    $ sudo apt-get update 
    $ sudo apt-get install hadoop-conf-pseudo

Starting Hadoop and Verifying it is Working Properly

For YARN, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, secondarynamenode, resourcemanager, datanode, and nodemanager.

  • To view the files on Red Hat or SLES systems:
$ rpm -ql hadoop-conf-pseudo
  • To view the files on Ubuntu systems:
$ dpkg -L hadoop-conf-pseudo

The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.

The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

To start Hadoop, proceed as follows.

Step 1: Format the NameNode.

Before starting the NameNode for the first time you must format the file system.

$ sudo -u hdfs hdfs namenode -format

Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.

Step 2: Start HDFS

$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

Step 3: Create the directories needed for Hadoop processes.

Issue the following command to create the directories needed for all installed Hadoop processes with the appropriate permissions.
$ sudo /usr/lib/hadoop/libexec/init-hdfs.sh

Step 4: Verify the HDFS File Structure:

Run the following command:

$ sudo -u hdfs hadoop fs -ls -R /

You should see output similar to the following excerpt:

...
drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
...

Step 5: Start YARN

$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start 
$ sudo service hadoop-mapreduce-historyserver start

Step 6: Create User Directories

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

$ sudo -u hdfs hadoop fs -mkdir /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER

Running an example application with YARN

  1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
    $ sudo -u hdfs hadoop fs -mkdir /user/joe 
    $ sudo -u hdfs hadoop fs -chown joe /user/joe

    Do the following steps as the user joe.

  2. Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
    $ hadoop fs -mkdir input
    $ hadoop fs -put /etc/hadoop/conf/*.xml input
    $ hadoop fs -ls input
    Found 3 items:
    -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml
    -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml
    -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
  3. Set HADOOP_MAPRED_HOME for user joe:
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  4. Run an example Hadoop job to grep with a regular expression in your input data.
    $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
  5. After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
    $ hadoop fs -ls 
    Found 2 items
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23

    You can see that there is a new directory called output23.

  6. List the output files.
    $ hadoop fs -ls output23 
    Found 2 items
    drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS
    -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000
  7. Read the results in the output file.
    $ hadoop fs -cat output23/part-r-00000 | head
    1 dfs.safemode.min.datanodes
    1 dfs.safemode.extension
    1 dfs.replication
    1 dfs.permissions.enabled
    1 dfs.namenode.name.dir
    1 dfs.namenode.checkpoint.dir
    1 dfs.datanode.data.dir