Deploying MapReduce v1 (MRv1) on a Cluster
This section describes configuration and startup tasks for MRv1 clusters only.
- Configure properties for MRv1 clusters
- Configure local storage directories for use by MRv1 daemons
- Configure a health check script for DataNode processes
- Configure JobTracker Recovery
- If necessary, deploy the configuration
- If necessary, start HDFS
- Create the HDFS /tmp directory
- Create MapReduce /var directories
- Verify the HDFS File Structure
- Create and configure the mapred.system.dir directory in HDFS
- Start MapReduce
- Create a Home Directory for each MapReduce User
- Set the HADOOP_MAPRED_HOME environment variable
- Configure the Hadoop daemons to start at boot time
Step 1: Configuring Properties for MRv1 Clusters
Property | Configuration File | Description |
---|---|---|
mapred.job.tracker | conf/mapred-site.xml | If you plan to run your cluster with MRv1 daemons you need to specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host>:<port>. See Ports Used by Components of CDH 5 for the default port. If the value is set to local, the default, the JobTracker runs on demand when you run a MapReduce job; do not try to start the JobTracker yourself in this case. Note: if you specify the host (rather than using local) this must be the hostname (for example mynamenode) not the IP address. |
Sample configuration:
mapred-site.xml:
<property> <name>mapred.job.tracker</name> <value>jobtracker-host.company.com:8021</value> </property>
Step 2: Configure Local Storage Directories for Use by MRv1 Daemons
For MRv1, you need to configure an additional property in the mapred-site.xml file.
Property | Configuration File Location | Description |
---|---|---|
mapred.local.dir | mapred-site.xml on each TaskTracker | This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, /data/1/mapred/local through /data/N/mapred/local. |
Sample configuration:
mapred-site.xml on each TaskTracker:
<property> <name>mapred.local.dir</name> <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value> </property>
After specifying these directories in the mapred-site.xml file, you must create the directories and assign the correct file permissions to them on each node in your cluster.
To configure local storage directories for use by MapReduce:
In the following instructions, local path examples are used to represent Hadoop parameters. The mapred.local.dir parameter is represented by the /data/1/mapred/local, /data/2/mapred/local, /data/3/mapred/local, and /data/4/mapred/local path examples. Change the path examples to match your configuration.
- Create the mapred.local.dir local directories:
$ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
- Configure the owner of the mapred.local.dir directory to be the mapred user:
$ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
The correct owner and permissions of these local directories are:Owner Permissions mapred:hadoop drwxr-xr-x
Step 3: Configure a Health Check Script for DataNode Processes
In CDH releases before CDH 4, the failure of a single mapred.local.dir caused the MapReduce TaskTracker process to shut down, resulting in the machine not being available to execute tasks. In CDH 5, as in CDH 4, the TaskTracker process will continue to execute tasks as long as it has a single functioning mapred.local.dir available. No configuration change is necessary to enable this behavior.
Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down after the configured number of directory failures). Here is an example health script that exits if the DataNode process is not running:
#!/bin/bash if ! jps | grep -q DataNode ; then echo ERROR: datanode not up fi
In practice, the dfs.data.dir and mapred.local.dir are often configured on the same set of disks, so a disk failure will result in the failure of both a dfs.data.dir and mapred.local.dir.
See the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation for further details.
Step 4: Configure JobTracker Recovery
JobTracker recovery means that jobs that are running when JobTracker fails (for example, because of a system crash or hardware failure) are re-run when the JobTracker is restarted. Any jobs that were running at the time of the failure will be re-run from the beginning automatically.
A recovered job will have the following properties:
- It will have the same job ID as when it was submitted.
- It will run under the same user as the original job.
- It will write to the same output directory as the original job, overwriting any previous output.
- It will show as RUNNING on the JobTracker web page after you restart the JobTracker.
Enabling JobTracker Recovery
By default JobTracker recovery is off, but you can enable it by setting the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.
Step 5: If Necessary, Deploy your Custom Configuration to your Entire Cluster
Deploy the configuration if you have not already done so.
Step 6: If Necessary, Start HDFS on Every Node in the Cluster
Start HDFS if you have not already done so .
Step 7: If Necessary, Create the HDFS /tmp Directory
Create the /tmp directory if you have not already done so.
Step 8: Create MapReduce /var directories
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
Step 9: Verify the HDFS File Structure
$ sudo -u hdfs hadoop fs -ls -R /
You should see:
drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
Step 10: Create and Configure the mapred.system.dir Directory in HDFS
After you start HDFS and create /tmp, but before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.
To create the directory in its default location:
$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system $ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------, assuming the user mapred owns that directory.
Step 11: Start MapReduce
To start MapReduce, start the TaskTracker and JobTracker services
On each TaskTracker system:
$ sudo service hadoop-0.20-mapreduce-tasktracker start
On the JobTracker system:
$ sudo service hadoop-0.20-mapreduce-jobtracker start
Step 12: Create a Home Directory for each MapReduce User
Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:
$ sudo -u hdfs hadoop fs -mkdir /user/<user> $ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
sudo -u hdfs hadoop fs -mkdir /user/$USER sudo -u hdfs hadoop fs -chown $USER /user/$USER
Step 13: Set HADOOP_MAPRED_HOME
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce