Installing CDH 5 with YARN on a Single Linux Node in Pseudo-distributed mode
Before you start, uninstall MRv1 if necessary
If you have already installed MRv1 following the steps in the previous section, you now need to uninstall hadoop-0.20-conf-pseudo before running YARN. Proceed as follows.
- Stop the daemons:
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done $ for x in 'cd /etc/init.d ; ls hadoop-0.20-mapreduce-* ; do sudo service $x stop ; done
- Remove hadoop-0.20-conf-pseudo:
- On Red Hat-compatible systems:
$ sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
- On SLES systems:
$ sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
- On Ubuntu or Debian systems:
$ sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*
Note: In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.
- On Red Hat-compatible systems:
If you have not already done so, install the Oracle Java Development Kit (JDK) before deploying CDH 5. Follow these instructions.
On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:
Download the CDH 5 Package
- Click the entry in the table below that matches your Red Hat or CentOS
system, choose Save File, and save the file to
a directory to which you have write access (it can be your home directory).
OS Version Click this Link Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link Red Hat/CentOS/Oracle 6 Red Hat/CentOS/Oracle 6 link - Install the RPM.
For Red Hat/CentOS/Oracle 5:
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
For Red Hat/CentOS/Oracle 6 (64-bit):
$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
Note: For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see Installing CDH 5 On Red Hat-compatible systems.
Install CDH 5
- (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
- For Red Hat/CentOS/Oracle 5 systems:
$ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
- For Red Hat/CentOS/Oracle 6 systems:
$ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
- For Red Hat/CentOS/Oracle 5 systems:
- Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
$ sudo yum install hadoop-conf-pseudo
On SLES systems, do the following:
Download and install the CDH 5 package
- Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).
- Install the RPM:
$ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm
Note: For instructions on how to add a CDH 5 SLES repository or build your own CDH 5 SLES repository, see Installing CDH 5 On SLES systems.
Install CDH 5
- (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
- For all SLES systems:
$ sudo rpm --import http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
- For all SLES systems:
- Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
$ sudo zypper install hadoop-conf-pseudo
On Ubuntu and other Debian systems, do the following:
Download and install the package
- Download the CDH 5 "1-click Install" package:
OS Version Click this Link Wheezy Wheezy link Precise Precise link - Install the package. Do one of the following:
- Choose Open with in the download window to use the package manager.
- Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
sudo dpkg -i cdh5-repository_1.0_all.deb
For instructions on how to add a CDH 5 Debian repository or build your own CDH 5 Debian repository, see Installing CDH 5 On Ubuntu or Debian systems.
Install CDH 5
- (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:
- For Ubuntu Lucid systems:
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
- For Ubuntu Precise systems:
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
- For Debian Squeeze systems:
$ curl -s http://archive.cloudera.com/cdh5/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -
- For Ubuntu Lucid systems:
- Install Hadoop in pseudo-distributed mode: To install Hadoop with YARN:
$ sudo apt-get update $ sudo apt-get install hadoop-conf-pseudo
Starting Hadoop and Verifying it is Working Properly
For YARN, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, secondarynamenode, resourcemanager, datanode, and nodemanager.
- To view the files on Red Hat or SLES systems:
$ rpm -ql hadoop-conf-pseudo
- To view the files on Ubuntu systems:
$ dpkg -L hadoop-conf-pseudo
The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.
The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.
To start Hadoop, proceed as follows.
Step 1: Format the NameNode.
Before starting the NameNode for the first time you must format the file system.
$ sudo -u hdfs hdfs namenode -format
Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.
In earlier releases, the hadoop-conf-pseudo package automatically formatted HDFS on installation. In CDH 5, you must do this explicitly.
Step 2: Start HDFS
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.
Step 3: Create the /tmp, Staging and Log Directories
- Remove the old /tmp if it exists:
$ sudo -u hdfs hadoop fs -rm -r /tmp
- Create the new directories and set permissions:
$ sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history/done_intermediate $ sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp $ sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn $ sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
Note: You need to create /var/log/hadoop/yarn because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in yarn-site.xml.
Step 4: Verify the HDFS File Structure:
Run the following command:
$ sudo -u hdfs hadoop fs -ls -R /
You should see the following directory structure:
drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
Step 5: Start YARN
$ sudo service hadoop-yarn-resourcemanager start $ sudo service hadoop-yarn-nodemanager start $ sudo service hadoop-mapreduce-historyserver start
Step 6: Create User Directories
Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:
$ sudo -u hdfs hadoop fs -mkdir /user/<user> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
$ sudo -u hdfs hadoop fs -mkdir /user/$USER $ sudo -u hdfs hadoop fs -chown $USER /user/$USER
Running an example application with YARN
- Create a home directory on HDFS for the user who will be running the job
(for example, joe):
$ sudo -u hdfs hadoop fs -mkdir /user/joe $ sudo -u hdfs hadoop fs -chown joe /user/joe
Do the following steps as the user joe.
- Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
$ hadoop fs -mkdir input $ hadoop fs -put /etc/hadoop/conf/*.xml input $ hadoop fs -ls input Found 3 items: -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
- Set HADOOP_MAPRED_HOME for user joe:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
- Run an example Hadoop job to grep with a regular expression in your input data.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
- After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
$ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
You can see that there is a new directory called output23.
- List the output files.
$ hadoop fs -ls output23 Found 2 items drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000
- Read the results in the output file.
$ hadoop fs -cat output23/part-r-00000 | head 1 dfs.safemode.min.datanodes 1 dfs.safemode.extension 1 dfs.replication 1 dfs.permissions.enabled 1 dfs.namenode.name.dir 1 dfs.namenode.checkpoint.dir 1 dfs.datanode.data.dir
<< Installing CDH 5 with MRv1 on a Single Linux Node in Pseudo-distributed mode | Components That Require Additional Configuration >> | |