3.1.5. Changes to the Shuffle Service

As in Hadoop Version 1, the shuffle is required for parallel MapReduce job operation. Reducers fetch the outputs from all of the maps by “shuffling” map-output data from the corresponding nodes where map-tasks have run. The MapReduce Shuffle functionality is implemented as an Auxiliary Service in the Node Manager. This service starts up a Netty Web Server in the Node Manager address space, and knows how to handle MapReduce-specific shuffle requests from Reduce tasks. The MapReduce Application Master specifies the service ID for the shuffle service, along with security tokens that may be required when it starts any Container. In the returning response, the Node Manager  provides the Application Master with the port on which the shuffle service is running, which is then passed on to the reduce tasks.

Hadoop Version 2 also provides the option for Encrypted Shuffle. With Encrypted Shuffle, the ability to use HTTPS with optional client authentication is provided. This feature is implemented with a toggle for HTTP or HTTPS, keystore/truststore properties, and the distribution of these stores to new or existing nodes. For details about the multi-step configuration of Encrypted Shuffle, review the Encrypted Shuffle documentation on the Apache Hadoop website.