Overview

Hadoop MapReduce is a software framework for processing large datasets. It uses many computers in parallel running in a cluster. The computer hosts function as if they were one large computer. For a brief overview of parallel processing, see the YouTube videotl;dr: Quick Introduction to Hadoop MapReduce.

MapReduce Jobs

A MapReduce job usually splits the input dataset into independent chunks. The job processes the chunks as map tasks, in parallel. During the map process, the master computer instructs worker computers to process their local input data. Hadoop sorts the output of the maps, in which each worker computer passes its results to the appropriate reducer computer. The master computer collects the results from all reducers and compiles the answer to the overall query. Both the input and the output of the job are usually stored in a distributed filesystem.

Typically, the compute hosts and the storage hosts are the same; that is, the MapReduce framework and the Hadoop Distributed File System (HDFS) run on the same set of hosts. This configuration allows the framework to effectively schedule tasks on the hosts where data is already present, resulting in very high aggregate bandwidth across the cluster.

MapReduce Versions

There are two versions of MapReduce: MRv1 and MRv2 (or YARN). The older runtime framework, MRv1, consists of a single JobTracker and one TaskTracker per cluster host. The JobTracker is responsible for scheduling the jobs' component tasks on the TaskTracker hosts, monitoring tasks, and re-executing failed tasks. The TaskTracker hosts execute tasks as directed by the JobTracker.

In the newer framework, MRv2, MapReduce jobs are usually referred to as applications. The work of the MRv1 JobTracker is separated into two parts: resource management and lifecycle management. Resource management is managing the assignment of applications to underlying compute resources. This is performed by a global ResourceManager.

Application-layer lifecycle management is performed by an ApplicationMaster. The work that was performed by the one-per-host TaskTracker in MRv1 is performed by theNodeManager on each host in MRv2. Additionally, where the MRv1 JobTracker kept information about past jobs, the resource manager and application master do not, so the cluster includes a JobHistory Server to store information about completed applications. For more information about the MRv2 (YARN) runtime framework, see About MapReduce v2 (YARN) in the CDH documentation or Apache Hadoop NextGen MapReduce (YARN) on the Apache Hadoop web site.

MapReduce Applications

Minimally, applications specify the input and output locations and supplymap and reduce functions through interfaces and abstract classes. These locations and functions and other job parameters compose the job configuration. The Hadoop job client then submits the job (executable, and so on) and configuration to the JobTracker (MRv1) or ResourceManager (MRv2) which then distributes the software and configuration to the TaskTracker (MRv1) or NodeManager (MRv2) hosts, scheduling tasks and monitoring them, and providing status and diagnostic information to the job client.

The Hadoop framework is implemented in Java. You can develop MapReduce applications in Java or any JVM-based language, or use one of the following interfaces:

Hadoop Streaming - A utility that allows you to create and run jobs with any executables (for example, shell utilities) as the mapper and/or the reducer.
Hadoop Pipes - a SWIG-compatible (not based on JNI) C++ API to implement MapReduce applications.

This tutorial describes applications written using the neworg.apache.hadoop.mapreduce Java API.

Hadoop Tutorial

Example: WordCount v1.0