Deploying MapReduce v1 (MRv1) on a Cluster

This section describes configuration and startup tasks for MRv1 clusters only.

Step 1: Configuring Properties for MRv1 Clusters

Property	Configuration File	Description
`mapred.job.tracker`	`conf/mapred-site.xml`	If you plan to run your cluster with MRv1 daemons you need to specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host>:<port>. See Ports Used by Components of CDH 5 for the default port. If the value is set to `local`, the default, the JobTracker runs on demand when you run a MapReduce job; do not try to start the JobTracker yourself in this case. Note: if you specify the host (rather than using `local`) this must be the hostname (for example `mynamenode`) not the IP address.

Sample configuration:

mapred-site.xml:

<property>
 <name>mapred.job.tracker</name>
 <value>jobtracker-host.company.com:8021</value>
</property>

Step 2: Configure Local Storage Directories for Use by MRv1 Daemons

For MRv1, you need to configure an additional property in the mapred-site.xml file.

Property	Configuration File Location	Description
`mapred.local.dir`	`mapred-site.xml` on each TaskTracker	This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, `/data/1/mapred/local` through `/data/N/mapred/local`.

Sample configuration:

mapred-site.xml on each TaskTracker:

<property>
 <name>mapred.local.dir</name>
 <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>

After specifying these directories in the mapred-site.xml file, you must create the directories and assign the correct file permissions to them on each node in your cluster.

To configure local storage directories for use by MapReduce:

In the following instructions, local path examples are used to represent Hadoop parameters. The mapred.local.dir parameter is represented by the /data/1/mapred/local, /data/2/mapred/local, /data/3/mapred/local, and /data/4/mapred/local path examples. Change the path examples to match your configuration.

Create the mapred.local.dir local directories:

$ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local

Configure the owner of the mapred.local.dir directory to be the mapred user:
```
$ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
```
The correct owner and permissions of these local directories are:

Owner Permissions

mapred:hadoop drwxr-xr-x

Owner	Permissions
`mapred:hadoop`	drwxr-xr-x

Step 3: Configure a Health Check Script for DataNode Processes

In CDH releases before CDH 4, the failure of a single mapred.local.dir caused the MapReduce TaskTracker process to shut down, resulting in the machine not being available to execute tasks. In CDH 5, as in CDH 4, the TaskTracker process will continue to execute tasks as long as it has a single functioning mapred.local.dir available. No configuration change is necessary to enable this behavior.

Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down after the configured number of directory failures). Here is an example health script that exits if the DataNode process is not running:

#!/bin/bash
if ! jps | grep -q DataNode ; then
 echo ERROR: datanode not up
fi

In practice, the dfs.data.dir and mapred.local.dir are often configured on the same set of disks, so a disk failure will result in the failure of both a dfs.data.dir and mapred.local.dir.

See the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation for further details.

Step 4: Configure JobTracker Recovery

JobTracker recovery means that jobs that are running when JobTracker fails (for example, because of a system crash or hardware failure) are re-run when the JobTracker is restarted. Any jobs that were running at the time of the failure will be re-run from the beginning automatically.

A recovered job will have the following properties:

It will have the same job ID as when it was submitted.
It will run under the same user as the original job.
It will write to the same output directory as the original job, overwriting any previous output.
It will show as RUNNING on the JobTracker web page after you restart the JobTracker.

Enabling JobTracker Recovery

By default JobTracker recovery is off, but you can enable it by setting the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.

Step 5: If Necessary, Deploy your Custom Configuration to your Entire Cluster

Deploy the configuration if you have not already done so.

Step 6: If Necessary, Start HDFS on Every Node in the Cluster

Start HDFS if you have not already done so .

Step 7: If Necessary, Create the HDFS /tmp Directory

Create the /tmp directory if you have not already done so.

Step 8: Create MapReduce /var directories

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Step 9: Verify the HDFS File Structure

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Step 10: Create and Configure the mapred.system.dir Directory in HDFS

After you start HDFS and create /tmp, but before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.

To create the directory in its default location:

$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------, assuming the user mapred owns that directory.

Step 11: Start MapReduce

To start MapReduce, start the TaskTracker and JobTracker services

On each TaskTracker system:

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-mapreduce-jobtracker start

Step 12: Create a Home Directory for each MapReduce User

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir  /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 13: Set `HADOOP_MAPRED_HOME`

For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

Step 14: Configure the Hadoop Daemons to Start at Boot Time

See Configuring the Daemons to Start on Boot

Deploying MapReduce v2 (YARN) on a Cluster

Configuring the Daemons to Start on Boot