This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

Upgrading to CDH 5

Note: Are you on the right page?

Use the instructions on this page only to upgrade from CDH 4.

To upgrade from an earlier CDH 5 release to the latest version:

Use these instructions to upgrade from a CDH 5 Beta release;
Use these instructions to upgrade from CDH 5.0.0 or later.

Important:

To upgrade from CDH 4, you must uninstall CDH 4, and then install CDH 5. Make sure you allow sufficient time for this, and do the necessary backup and preparation as described below.
If you have configured HDFS HA with NFS shared storage, do not proceed. This configuration is not supported on CDH 5; Quorum-based storage is the only supported HDFS HA configuration on CDH 5. Unconfigure your NFS shared storage configuration before you attempt to upgrade.

Note: Running Services

When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in /etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)

To upgrade to the latest CDH 5 release, perform the following steps.

Back Up Configuration Data and Stop Services
Back up the HDFS Metadata
Update Alternatives
Uninstall the CDH 4 Version of Hadoop
Download the Latest Version of CDH 5
Install CDH 5 with YARN
Install CDH 5 with MRv1
Copy the CDH 5 Logging File
In an HA Deployment, Upgrade and Start the Journal Nodes
Upgrade the HDFS Metadata
Start YARN or MapReduce MRv1
Set the Sticky Bit
Re-Install CDH 5 Components
Apply Configuration File Changes
Finalize the HDFS Metadata Upgrade

Back Up Configuration Data and Stop Services

Put the NameNode into safe mode and save the fsimage:
1. Put the NameNode (or active NameNode in an HA configuration) into safe mode:
```
$ sudo -u hdfs hdfs dfsadmin -safemode enter
```
2. Perform a saveNamespace operation:
```
$ sudo -u hdfs hdfs dfsadmin -saveNamespace 
```
  This will result in a new fsimage being written out with no edit log entries.
3. With the NameNode still in safe mode, shut down all services as instructed below.
For each component you are using, back up configuration data, databases, and other important files.

Shut down the Hadoop services across your entire cluster:

for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done

Check each host to make sure that there are no processes running as the hdfs or mapred users from root:
```
# ps -aef | grep java
```

Back up the HDFS Metadata

Important:

Do this step when you are sure that all Hadoop services have been shut down. It is particularly important that the NameNode service is not running so that you can make a consistent backup.

To back up the HDFS metadata on the NameNode machine:

Note:

Cloudera recommends backing up HDFS metadata on a regular basis, as well as before a major upgrade.
dfs.name.dir is deprecated but still works; dfs.namenode.name.dir is preferred. This example uses dfs.name.dir.

Find the location of your dfs.name.dir (or dfs.namenode.name.dir); for example:

$ grep -C1 dfs.name.dir /etc/hadoop/conf/hdfs-site.xml

You should see something like this:

<property>
<name>dfs.name.dir</name>
<value>/mnt/hadoop/hdfs/name</value>

Back up the directory. The path inside the <value> XML element is the path to your HDFS metadata. If you see a comma-separated list of paths, there is no need to back up all of them; they store the same data. Back up the first directory, for example, by using the following commands:
```
$ cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
```
Warning:
If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps, starting by shutting down the Hadoop services.

Update Alternatives

On each node in the cluster:

Update the alternatives, for example:

$ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50

Verify that the operation succeeded:

$ sudo alternatives --display hadoop-conf

Uninstall the CDH 4 Version of Hadoop

Warning:

Do not proceed before you have backed up the HDFS metadata, and the files and databases for the individual components, as instructed in the previous steps.

To uninstall Hadoop:

Run this command on each host:

On Red Hat-compatible systems:

$ sudo yum remove  bigtop-utils bigtop-jsvc bigtop-tomcat sqoop2-client hue-common solr

On SLES systems:

$ sudo zypper remove bigtop-utils bigtop-jsvc bigtop-tomcat sqoop2-client hue-common solr

On Ubuntu systems:

sudo apt-get remove bigtop-utils bigtop-jsvc bigtop-tomcat sqoop2-client hue-common solr

Remove CDH 4 Repository Files

Remove all Cloudera CDH 4 repository files. For example, on a Red Hat or similar system, remove all files in /etc/yum.repos.d that have cloudera as part of the name.

Important:

Before removing the files, make sure you have not added any custom entries that you want to preserve. (To preserve custom entries, back up the files before removing them.)
Make sure you remove Impala and Search repository files, as well as the CDH repository file.

Download the Latest Version of CDH 5

Note:

For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see Installing CDH 5.

On Red Hat-compatible systems:

Download the CDH 5 "1-click Install" package.

Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

OS Version	Click this Link
Red Hat/CentOS/Oracle 5	Red Hat/CentOS/Oracle 5 link
Red Hat/CentOS/Oracle 6	Red Hat/CentOS/Oracle 6 link

Install the RPM:

Red Hat/CentOS/Oracle 5

$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm

Red Hat/CentOS/Oracle 6

$ sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm

Note: Make sure your repositories are up to date

Before proceeding, make sure the repositories on each system are up to date:

sudo yum clean all

This ensures that the system repositories contain the latest software (it does not actually install anything).

Now (optionally) add a repository key:

For Red Hat/CentOS/Oracle 5 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera

For Red Hat/CentOS/Oracle 6 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

On SLES systems:

Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).

Install the RPM:

$ sudo rpm -i cloudera-cdh-5-0.x86_64.rpm

Update your system package index by running:
```
$ sudo zypper refresh
```

Note: Make sure your repositories are up to date

Before proceeding, make sure the repositories on each system are up to date:

sudo zypper clean --all

This ensures that the system repositories contain the latest software (it does not actually install anything).

Now (optionally) add a repository key:

$ sudo rpm --import http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera

On Ubuntu and Debian systems:

Download the CDH 5 "1-click Install" package:

OS Version	Click this Link
Wheezy	Wheezy link
Precise	Precise link

Install the package. Do one of the following:
- Choose Open with in the download window to use the package manager.
- Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
```
sudo dpkg -i cdh5-repository_1.0_all.deb
```

Note: Make sure your repositories are up to date

Before proceeding, make sure the repositories on each system are up to date:

sudo apt-get update

This ensures that the system repositories contain the latest software (it does not actually install anything).

Now (optionally) add a repository key:

For Ubuntu Precise systems:

$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -

For Debian Wheezy systems:

$ curl -s http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -

Install CDH 5 with YARN

Note:

Skip this step and go to Install CDH 5 with MRv1 if you intend to use only MRv1.

Install and deploy ZooKeeper.
Important:
Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

Follow instructions under ZooKeeper Installation.

Install each type of daemon package on the appropriate systems(s), as follows.

Where to install	Install commands
Resource Manager host (analogous to MRv1 JobTracker) running:
Red Hat/CentOS compatible	sudo yum clean all; sudo yum install hadoop-yarn-resourcemanager
SLES	sudo zypper clean --all; sudo zypper install hadoop-yarn-resourcemanager
Ubuntu or Debian	sudo apt-get update; sudo apt-get install hadoop-yarn-resourcemanager
NameNode host(s) running:
Red Hat/CentOS compatible	sudo yum clean all; sudo yum install hadoop-hdfs-namenode
SLES	sudo zypper clean --all; sudo zypper install hadoop-hdfs-namenode
Ubuntu or Debian	sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode
Secondary NameNode host (if used) running:
Red Hat/CentOS compatible	sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode
SLES	sudo zypper clean --all; sudo zypper install hadoop-hdfs-secondarynamenode
Ubuntu or Debian	sudo apt-get update; sudo apt-get install hadoop-hdfs-secondarynamenode
All cluster hosts except the Resource Manager running:
Red Hat/CentOS compatible	sudo yum clean all; sudo yum install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce
SLES	sudo zypper clean --all; sudo zypper clean --all; sudo zypper install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce
Ubuntu or Debian	sudo apt-get update; sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce
One host in the cluster running:
Red Hat/CentOS compatible	sudo yum clean all; sudo yum install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver
SLES	sudo zypper clean --all; sudo zypper install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver
Ubuntu or Debian	sudo apt-get update; sudo apt-get install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver
All client hosts, running:
Red Hat/CentOS compatible	sudo yum clean all; sudo yum install hadoop-client
SLES	sudo zypper clean --all; sudo zypper install hadoop-client
Ubuntu or Debian	sudo apt-get update; sudo apt-get install hadoop-client

Note:

The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as dependencies of the other packages.

If you are installing Llama, make sure that hadoop.proxyuser.llama.hosts and hadoop.proxyuser.llama.groups are configured in your core-site.xml as follows:

 <property>
    <name>hadoop.proxyuser.llama.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.llama.groups</name>
    <value>*</value>
  </property>

Install CDH 5 with MRv1

Note:

Skip this step if you intend to use only YARN. If you are installing both YARN and MRv1, you can skip any packages you have already installed in Step 6a.

To install CDH 5 with MRv1:

Note:

If you are also installing YARN, you can skip any packages you have already installed in Step 6a.

Install and deploy ZooKeeper.
Important:
Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

Follow instructions under ZooKeeper Installation.

Install each type of daemon package on the appropriate systems(s), as follows.

Where to install	Install commands
JobTracker host running:
Red Hat/CentOS compatible	`sudo yum clean all; sudo yum install hadoop-0.20-mapreduce-jobtracker`
SLES	`sudo zypper clean --all; sudo zypper install hadoop-0.20-mapreduce-jobtracker`
Ubuntu or Debian	`sudo apt-get update; sudo apt-get install hadoop-0.20-mapreduce-jobtracker`
NameNode host(s) running:
Red Hat/CentOS compatible	`sudo yum clean all; sudo yum install hadoop-hdfs-namenode`
SLES	`sudo zypper clean --all; sudo zypper install hadoop-hdfs-namenode`
Ubuntu or Debian	`sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode`
Secondary NameNode host (if used) running:
Red Hat/CentOS compatible	`sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode`
SLES	`sudo zypper clean --all; sudo zypper install hadoop-hdfs-secondarynamenode`
Ubuntu or Debian	`sudo apt-get update; sudo apt-get install hadoop-hdfs-secondarynamenode`
All cluster hosts except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts, running:
Red Hat/CentOS compatible	`sudo yum clean all; sudo yum install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode`
SLES	`sudo zypper clean --all; sudo zypper install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode`
Ubuntu or Debian	`sudo apt-get update; sudo apt-get install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode`
All client hosts, running:
Red Hat/CentOS compatible	`sudo yum clean all; sudo yum install hadoop-client`
SLES	`sudo zypper clean --all; sudo zypper install hadoop-client`
Ubuntu or Debian	`sudo apt-get update; sudo apt-get install hadoop-client`

Copy the CDH 5 Logging File

Copy over the log4j.properties file to your custom directory on each node in the cluster; for example:

$ cp /etc/hadoop/conf.empty/log4j.properties /etc/hadoop/conf.my_cluster/log4j.properties

In an HA Deployment, Upgrade and Start the Journal Nodes

Install the JournalNode daemons on each of the machines where they will run.
To install JournalNode on Red Hat-compatible systems:
```
$ sudo yum install hadoop-hdfs-journalnode
```
To install JournalNode on Ubuntu and Debian systems:
```
$ sudo apt-get install hadoop-hdfs-journalnode 
```
To install JournalNode on SLES systems:
```
$ sudo zypper install hadoop-hdfs-journalnode
```
Start the JournalNode daemons on each of the machines where they will run:
```
sudo service hadoop-hdfs-journalnode start 
```

Wait for the daemons to start before proceeding to the next step.

Important:

The JournalNodes must be up and running CDH 5 before you proceed.

Upgrade the HDFS Metadata

Note:

What you do in this step differs depending on whether you are upgrading an HDFS HA deployment using Quorum-based storage, or a non-HA deployment using a secondary NameNode. (If you have an HDFS HA deployment using NFS storage, do not proceed; you cannot upgrade that configuration to CDH 5. Unconfigure your NFS shared storage configuration before you attempt to upgrade.)

For an HA deployment, do sub-steps 1, 2, and 3.
For a non-HA deployment, do sub-steps 1, 3, and 4.

To upgrade the HDFS metadata, run the following command on the NameNode. If HA is enabled, do this on the active NameNode only, and make sure the JournalNodes have been upgraded to CDH 5 and are up and running before you run the command.
```
$ sudo service hadoop-hdfs-namenode -upgrade
```
Important:
In an HDFS HA deployment, it is critically important that you do this on only one NameNode.

You can watch the progress of the upgrade by running:
```
$ sudo tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log 
```
Look for a line that confirms the upgrade is complete, such as: /var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete
Note:
The NameNode upgrade process can take a while depending on how many files you have.
Do this step only in an HA configuration. Otherwise skip to starting up the DataNodes.
Wait for NameNode to exit safe mode, and then re-start the standby NameNode.
- If Kerberos is enabled:
```
$ kinit -kt /path/to/hdfs.keytab hdfs/<fully.qualified.domain.name@YOUR-REALM.COM> && hdfs namenode -bootstrapStandby
```
```
$ sudo service hadoop-hdfs-namenode start
```
- If Kerberos is not enabled:
```
$ sudo -u hdfs hdfs namenode -bootstrapStandby
$ sudo service hadoop-hdfs-namenode start
```
For more information about the haadmin -failover command, see Administering an HDFS High Availability Cluster.

Start up the DataNodes:

On each DataNode:

$ sudo service hadoop-hdfs-datanode start

Do this step only in a non-HA configuration. Otherwise skip to starting YARN or MRv1.
Wait for NameNode to exit safe mode, and then start the Secondary NameNode.
1. To check that the NameNode has exited safe mode, look for messages in the log file, or the NameNode's web interface, that say "...no longer in safe mode."
2. To start the Secondary NameNode (if used), enter the following command on the Secondary NameNode host:
```
$ sudo service hadoop-hdfs-secondarynamenode start 
```
3. To complete the cluster upgrade, follow the remaining steps below.

Start YARN or MapReduce MRv1

You are now ready to start and test MRv1 or YARN.

For YARN	or For MRv1
Start YARN and the MapReduce JobHistory Server	Start MRv1
Verify basic cluster operation	Verify basic cluster operation

Start MapReduce with YARN

Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade your performance and may result in an unstable MapReduce cluster deployment. Steps 10a and 10b are mutually exclusive.

After you have verified HDFS is operating correctly, you are ready to start YARN. First, create directories and set the correct permissions.

For more information see Deploying MapReduce v2 (YARN) on a Cluster.

Create a history directory and set permissions; for example:

sudo -u hdfs hadoop fs -mkdir /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown yarn /user/history

Create the /var/log/hadoop-yarn directory and set ownership:

$ sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
$ sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in the yarn-site.xml.

Verify the directory structure, ownership, and permissions:

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt   - hdfs supergroup          0 2012-04-19 14:31 /tmp
drwxr-xr-x   - hdfs supergroup          0 2012-05-31 10:26 /user
drwxrwxrwt   - yarn supergroup          0 2012-04-19 14:31 /user/history
drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var
drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var/log
drwxr-xr-x   - yarn   mapred            0 2012-05-31 15:31 /var/log/hadoop-yarn

To start YARN, start the ResourceManager and NodeManager services:

Note:

Make sure you always start ResourceManager before starting NodeManager services.

On the ResourceManager system:

$ sudo service hadoop-yarn-resourcemanager start

On each NodeManager system (typically the same ones where DataNode service runs):

$ sudo service hadoop-yarn-nodemanager start

To start the MapReduce JobHistory Server

On the MapReduce JobHistory Server system:

$ sudo service hadoop-mapreduce-historyserver start

For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly as follows:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Verify basic cluster operation for YARN.

At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site.

Note:

For important configuration information, see Deploying MapReduce v2 (YARN) on a Cluster.

Create a home directory on HDFS for the user who will be running the job (for example, joe):
```
$ sudo -u hdfs hadoop fs -mkdir /user/joe
$ sudo -u hdfs hadoop fs -chown joe /user/joe 
```
Do the following steps as the user joe.

Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:

$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
Found 3 items:
-rw-r--r--   1 joe supergroup       1348 2012-02-13 12:21 input/core-site.xml
-rw-r--r--   1 joe supergroup       1913 2012-02-13 12:21 input/hdfs-site.xml
-rw-r--r--   1 joe supergroup       1001 2012-02-13 12:21 input/mapred-site.xml

Set HADOOP_MAPRED_HOME for user joe:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Run an example Hadoop job to grep with a regular expression in your input data.

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'

After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
```
$ hadoop fs -ls
Found 2 items
drwxr-xr-x   - joe supergroup  0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x   - joe supergroup  0 2009-08-18 18:38 /user/joe/output23
```
You can see that there is a new directory called output23.

List the output files.

$ hadoop fs -ls output23
Found 2 items
drwxr-xr-x  -  joe supergroup     0 2009-02-25 10:33   /user/joe/output23/_SUCCESS
-rw-r--r--  1  joe supergroup  1068 2009-02-25 10:33   /user/joe/output23/part-r-00000

Read the results in the output file.

$ hadoop fs -cat output23/part-r-00000 | head
1    dfs.safemode.min.datanodes
1    dfs.safemode.extension
1    dfs.replication
1    dfs.permissions.enabled
1    dfs.namenode.name.dir
1    dfs.namenode.checkpoint.dir
1    dfs.datanode.data.dir

You have now confirmed your cluster is successfully running CDH 5.

Important:

If you have client hosts, make sure you also update them to CDH 5, and upgrade the components running on those clients as well.

Start MapReduce (MRv1)

Important:

After you have verified HDFS is operating correctly, you are ready to start MapReduce. On each TaskTracker system:

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-mapreduce-jobtracker start

Verify that the JobTracker and TaskTracker started properly.

$ sudo jps | grep Tracker

If the permissions of directories are not configured correctly, the JobTracker and TaskTracker processes start and immediately fail. If this happens, check the JobTracker and TaskTracker logs and set the permissions correctly.

Important:

For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

Verify basic cluster operation for MRv1.

At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site.

Create a home directory on HDFS for the user who will be running the job (for example, joe):
```
$ sudo -u hdfs hadoop fs -mkdir /user/joe
$ sudo -u hdfs hadoop fs -chown joe /user/joe
```
Do the following steps as the user joe.

Make a directory in HDFS called input and copy some XML files into it by running the following commands:

$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
Found 3 items:
-rw-r--r--   1 joe supergroup       1348 2012-02-13 12:21 input/core-site.xml
-rw-r--r--   1 joe supergroup       1913 2012-02-13 12:21 input/hdfs-site.xml
-rw-r--r--   1 joe supergroup       1001 2012-02-13 12:21 input/mapred-site.xml

Set HADOOP_MAPRED_HOME for user joe:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce/

Run an example Hadoop job to grep with a regular expression in your input data.

$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'

After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop.
```
$ hadoop fs -ls
Found 2 items
drwxr-xr-x   - joe supergroup  0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x   - joe supergroup  0 2009-08-18 18:38 /user/joe/output
```
You can see that there is a new directory called output.

List the output files.

$ hadoop fs -ls output
Found 2 items
drwxr-xr-x  -  joe supergroup     0 2009-02-25 10:33   /user/joe/output/_logs
-rw-r--r--  1  joe supergroup  1068 2009-02-25 10:33   /user/joe/output/part-00000
-rw-r--r-   1  joe supergroup     0 2009-02-25 10:33   /user/joe/output/_SUCCESS

Read the results in the output file; for example:
```
$ hadoop fs -cat output/part-00000 | head
1       dfs.datanode.data.dir
1       dfs.namenode.checkpoint.dir
1       dfs.namenode.name.dir
1       dfs.replication
1       dfs.safemode.extension
1       dfs.safemode.min.datanodes
```
You have now confirmed your cluster is successfully running CDH 5.

Important:
If you have client hosts, make sure you also update them to CDH 5, and upgrade the components running on those clients as well.

Set the Sticky Bit

For security reasons Cloudera strongly recommends you set the sticky bit on directories if you have not already done so.

The sticky bit prevents anyone except the superuser, directory owner, or file owner from deleting or moving the files within a directory. (Setting the sticky bit for a file has no effect.) Do this for directories such as /tmp. (For instructions on creating /tmp and setting its permissions, see these instructions).

Re-Install CDH 5 Components

CDH 5 Components

Use the following sections to install or upgrade CDH 5 components: See also the instructions for installing or updating LZO.

Apply Configuration File Changes

Important:

During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH 4 configuration file to the new CDH 5 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.

For example, if you have modified your CDH 4 zoo.cfg configuration file (/etc/zookeeper.dist/zoo.cfg), RPM uninstall and re-install (using yum remove) renames and preserves a copy of your modified zoo.cfg as /etc/zookeeper.dist/zoo.cfg.rpmsave. You should compare this to the new /etc/zookeeper/conf/zoo.cfg and resolve any differences that should be carried forward (typically where you have changed property value defaults). Do this for each component you upgrade to CDH 5.

Finalize the HDFS Metadata Upgrade

To finalize the HDFS metadata upgrade you began earlier in this procedure, proceed as follows:

Make sure you are satisfied that the CDH 5 upgrade has succeeded and everything is running smoothly. This could take a matter of days, or even weeks.
Warning:
Do not proceed until you are sure you are satisfied with the new deployment. Once you have finalized the HDFS metadata, you cannot revert to an earlier version of HDFS.
Note:
If you need to restart the NameNode during this period (after having begun the upgrade process, but before you've run finalizeUpgrade) simply restart your NameNode without the -upgrade option.
Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos is enabled (see Configuring Hadoop Security in CDH 5).
Important:
In an HDFS HA deployment, make sure that both the NameNodes and all of the JournalNodes are up and functioning normally before you proceed.
- If Kerberos is enabled:
```
$ kinit -kt /path/to/hdfs.keytab hdfs/<fully.qualified.domain.name@YOUR-REALM.COM> && hdfs dfsadmin -finalizeUpgrade
```
- If Kerberos is not enabled:
```
$ sudo -u hdfs hdfs dfsadmin -finalizeUpgrade
```
Note:
After the metadata upgrade completes, the previous/ and blocksBeingWritten/ directories in the DataNodes' data directories aren't cleared until the DataNodes are restarted.

Page generated September 3, 2015.