Deploying MapReduce v1 (MRv1) on a Cluster

This topic describes configuration and startup tasks for MRv1 clusters only.

  1. Make sure you have configured and deployed HDFS.
  2. Configure the JobTracker's RPC server.
    1. Open the mapred-site.xml file in the custom directory you created when you copied the Hadoop configuration.
    2. Specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host><port>. The default value is local. With the default value, JobTracker runs on demand when you run a MapReduce job. Do not try to start the JobTracker yourself in this case. If you specify the host other than local, use the hostname (for example mynamenode) not the IP address.

      For example:

      <property>
       <name>mapred.job.tracker</name>
       <value>jobtracker-host.company.com:8021</value>
      </property>
  3. Configure local storage directories for use by MRv1 daemons.
    1. Open the mapred-site.xml file in the custom directory you created when you copied the Hadoop configuration.
    2. Edit the mapred.local.dir property to specify the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that you specify a directory on each of the JBOD mount points: /data/1/mapred/local through /data/N/mapred/local. For example:
      <property>
       <name>mapred.local.dir</name>
       <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
      </property>
    3. Create the mapred.local.dir local directories:
      $ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
    4. Configure the owner of the mapred.local.dir directory to be the mapred user:
      $ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
      .
    5. Set the permissions to drwxr-xr-x.
    6. Configure a health check script for DataNode processes.

      Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down after the configured number of directory failures). Here is an example health script that exits if the DataNode process is not running:

      #!/bin/bash
      if ! jps | grep -q DataNode ; then
       echo ERROR: datanode not up
      fi

      In practice, the dfs.data.dir and mapred.local.dir are often configured on the same set of disks, so a disk failure will result in the failure of both a dfs.data.dir and mapred.local.dir.

      For more information, go to the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation.

    7. Set the mapreduce.jobtracker.restart.recover property to true. This ensures that running jobs that fail because of a system crash or hardware failure are re-run when the JobTracker restarts. A recovered job has the following properties:
      • It will have the same job ID as when it was submitted.
      • It will run under the same user as the original job.
      • It will write to the same output directory as the original job, overwriting any previous output.
      • It will show as RUNNING on the JobTracker web page after you restart the JobTracker.
    8. Repeat for each TaskTracker.
  4. Configure a health check script for DataNode processes.

    Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down after the configured number of directory failures). The following is an example health script that exits if the DataNode process is not running:

    #!/bin/bash
    if ! jps | grep -q DataNode ; then
     echo ERROR: datanode not up
    fi

    For more information, go to the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation.

  5. Configure JobTracker recovery.

    Set the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.

    JobTracker ensures that running jobs that fail because of a system crash or hardware failure are re-run when the JobTracker restarts. A recovered job has the following properties:

    • It will have the same job ID as when it was submitted.
    • It will run under the same user as the original job.
    • It will write to the same output directory as the original job, overwriting any previous output.
    • It will show as RUNNING on the JobTracker web page after you restart the JobTracker.
  6. Create MapReduce /var directories:
    sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
    sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
    sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
  7. Verify the HDFS file structure:
    $ sudo -u hdfs hadoop fs -ls -R /

    You should see:

    drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
    drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
    drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
    drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
    drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
  8. Create and configure the mapred.system.dir directory in HDFS. The HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.

    To create the directory in its default location:

    $ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
    $ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

    When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------, assuming the user mapred owns that directory.

  9. Start MapReduce by starting the TaskTracker and JobTracker services.
    • On each TaskTracker system:
      $ sudo service hadoop-0.20-mapreduce-tasktracker start
    • On the JobTracker system:
      $ sudo service hadoop-0.20-mapreduce-jobtracker start
  10. Create a home directory for each MapReduce user. On the NameNode, enter:
    $ sudo -u hdfs hadoop fs -mkdir  /user/<user>
    $ sudo -u hdfs hadoop fs -chown <user> /user/<user>

    where <user> is the Linux username of each user.

    Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

    sudo -u hdfs hadoop fs -mkdir /user/$USER
    sudo -u hdfs hadoop fs -chown $USER /user/$USER
  11. Set HADOOP_MAPRED_HOME.
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

    Set this environment variable for each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation.

  12. Configure the Hadoop daemons to start at boot time. For more information, see Configuring the Daemons to Start on Boot.