Major changes when migrating to MapReduce 2
Reviewing the differences between MapReduce version 1 (MRv1) and MapReduce version 2 (MRv2) helps you to understand the changes to the capabilities and services that have replaced the deprecated ones.
Architectural changes
- YARN (Yet Another Resource Negotiator): cluster resource management capabilities
- MapReduce: MapReduce-specific capabilities
In the MapReduce version 1 (MRv1) architecture, the cluster is managed by a service called the JobTracker. TaskTracker services lived on each host and would launch tsks on behalf of jobs. The JobTracker would serve information about completed jobs.
In MapReduce version 2 (MRv2), the functions of the JobTracker are split between four services:
- ResourceManager
- ApplicationMaster
- JobHistory Server
- NodeManager
Configuration options with new name
Many configuration options have new names to reflect the shift. As JobTrackers and TaskTrackers no longer exist in MRv2, all configuration options pertaining to them no longer exist, although many of them have corresponding options for the ResourceManager, NodeManager, and JobHistoryServer.
The vast majority of job configuration options that were available in MRv1 work in MRv2 as well. For consistency and clarity, many options have been given new names. The older names are deprecated, but will still work for the time being. The exceptions to this are mapred.child.ulimit and all options relating to JVM reuse, as these are no longer supported.
Managing resources
One of the larger changes in MRv2 is the way that resources are managed. In MRv1, each host was configured with a fixed number of map slots and a fixed number of reduces slots. Under YARN, there is no distinction between resources available for maps and resources available for reduces - all resources are available for both.
The notion of slots has been discarded, and resources are now configured in terms of amounts of memory (in megabytes) and CPU (in “virtual cores”, which are described below).
In MRv1, the mapred.tasktracker.map.tasks.maximum
and
mapred.tasktracker.reduce.tasks.maximum
properties dictated how many map and
reduce slots each TaskTracker had. These properties no longer exist in YARN. Instead, YARN uses
yarn.nodemanager.resource.memory-mb
and
yarn.nodemanager.resource.cpu-vcores
, which control the amount of memory and
CPU on each host, both available to both maps and reduces.
Administration commands
The jobtracker
and tasktracker
commands, which start the
JobTracker and TaskTracker, are no longer supported because these services no longer exist. They
are replaced with yarn resourcemanager
and yarn nodemanager
,
which start the ResourceManager and NodeManager respectively. hadoop mradmin
is
no longer supported. Instead, yarn rmadmin
can be used.
Secure cluster setup
As in MRv1, a configuration must be set to have the user that submits a job own its task
processes. The equivalent of the MRv1 LinuxTaskController is the LinuxContainerExecutor. In a
secure setup, NodeManager configurations should set
yarn.nodemanager.container-executor.class
to
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
. Properties
set in the taskcontroller.cfg
configuration file should be transitioned to
their analogous properties in the container-executor.cfg
file.
Queue access control lists (ACLs) are now placed in the Capacity Scheduler configuration file instead of the JobTracker configuration.
High Availability
- Failover controller is moved from a separate ZKFC daemon to be a part of the ResourceManaget itself, meaning that there is no need to run an additional daemon.
- Clients, applications, and NodeManagers do not requre configuring a proxy-provider to talk to the active ResourceManager.