This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Installing CDH 5 with MRv1 on a Single Linux Node in Pseudo-distributed mode

  Important:
  • Running services: when starting, stopping and restarting CDH components, always use the service (8) command rather than running /etc/init.d scripts directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the /etc/init.d scripts directly, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)
  • Java Development Kit: if you have not already done so, install the Oracle Java Development Kit (JDK) before deploying CDH. Follow these instructions.

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

Download the CDH 5 Package

  1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).
    OS Version Click this Link
    Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
    Red Hat/CentOS/Oracle 6 Red Hat/CentOS/Oracle 6 link
  2. Install the RPM.

    For Red Hat/CentOS/Oracle 5:

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm 

    For Red Hat/CentOS/Oracle 6 (64-bit):

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
      Note:

    For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see Installing CDH 5 On Red Hat-compatible systems.

Install CDH 5

  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing one of the the following commands:
    • For Red Hat/CentOS/Oracle 5 systems:
      $ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera 
    • For Red Hat/CentOS/Oracle 6 systems:
      $ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera 
  2. Install Hadoop in pseudo-distributed mode:
    To install Hadoop with MRv1:
    $ sudo yum install hadoop-0.20-conf-pseudo

On SLES systems, do the following:

Download and install the CDH 5 package

  1. Download the CDH 5 "1-click Install" package.

    Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).

  2. Install the RPM:
    $ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm
      Note:

    For instructions on how to add a CDH 5 SLES repository or build your own CDH 5 SLES repository, see Installing CDH 5 On SLES systems.

Install CDH 5

  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For all SLES systems:
      $ sudo rpm --import http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
  2. Install Hadoop in pseudo-distributed mode:

    To install Hadoop with MRv1:

    $ sudo zypper install hadoop-0.20-conf-pseudo 

On Ubuntu and other Debian systems, do the following:

Download and install the package

  1. Download the CDH 5 "1-click Install" package:
    OS Version Click this Link
    Wheezy Wheezy link
    Precise Precise link
  2. Install the package. Do one of the following:
    • Choose Open with in the download window to use the package manager.
    • Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
      sudo dpkg -i cdh5-repository_1.0_all.deb
  Note:

For instructions on how to add a CDH 5 Debian repository or build your own CDH 5 Debian repository, see Installing CDH 5 on Ubuntu or Debian systems.

Install CDH 5

  1. (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
    • For Ubuntu Lucid systems:
      $ curl -s http://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
    • For Ubuntu Precise systems:
      $ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
    • For Debian Squeeze systems:
      $ curl -s http://archive.cloudera.com/cdh5/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -
  2. Install Hadoop in pseudo-distributed mode:

    To install Hadoop with MRv1:

    $ sudo apt-get update
    $ sudo apt-get install hadoop-0.20-conf-pseudo

Starting Hadoop and Verifying it is Working Properly:

For MRv1, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, jobtracker, secondarynamenode, datanode, and tasktracker.

To verify the hadoop-0.20-conf-pseudo packages on your system.

  • To view the files on Red Hat or SLES systems:
$ rpm -ql hadoop-0.20-conf-pseudo 
  • To view the files on Ubuntu systems:
$ dpkg -L hadoop-0.20-conf-pseudo

The new configuration is self-contained in the /etc/hadoop/conf.pseudo.mr1 directory.

  Note:

The Cloudera packages use the alternatives framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

To start Hadoop, proceed as follows.

Step 1: Format the NameNode.

Before starting the NameNode for the first time you must format the file system.

$ sudo -u hdfs hdfs namenode -format
  Note:

Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.

  Note:

If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>

  Important:

In earlier releases, the hadoop-conf-pseudo package automatically formatted HDFS on installation. In CDH 5, you must do this explicitly.

Step 2: Start HDFS

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

Step 3: Create the /tmp Directory

Create the /tmp directory and set permissions:

  Important:

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir -p /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Step 4: Create the MapReduce system directories:

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Step 5: Verify the HDFS File Structure

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Step 6: Start MapReduce

for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done

To verify services have started, you can check the web console. The JobTracker provides a web console http://localhost:50030/ for viewing and running completed and failed jobs with logs.

Step 7: Create User Directories

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir -p /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

$ sudo -u hdfs hadoop fs -mkdir -p /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER

Running an example application with MRv1

  1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
    sudo -u hdfs hadoop fs -mkdir -p /user/joe
    sudo -u hdfs hadoop fs -chown joe /user/joe

    Do the following steps as the user joe.

  2. Make a directory in HDFS called input and copy some XML files into it by running the following commands:
    $ hadoop fs -mkdir input
    $ hadoop fs -put /etc/hadoop/conf/*.xml input
    $ hadoop fs -ls input
    Found 3 items:
    -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml
    -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml
    -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
  3. Run an example Hadoop job to grep with a regular expression in your input data.
    $ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
  4. After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop.
    $ hadoop fs -ls
    Found 2 items
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
    drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output

    You can see that there is a new directory called output.

  5. List the output files.
    $ hadoop fs -ls output
    Found 2 items
    drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output/_logs
    -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output/part-00000
    -rw-r--r- 1 joe supergroup 0 2009-02-25 10:33 /user/joe/output/_SUCCESS
  6. Read the results in the output file; for example:
    $ hadoop fs -cat output/part-00000 | head
    1 dfs.datanode.data.dir
    1 dfs.namenode.checkpoint.dir
    1 dfs.namenode.name.dir
    1 dfs.replication
    1 dfs.safemode.extension
    1 dfs.safemode.min.datanodes
Page generated September 3, 2015.