8.2. MapReduce

This tab covers MapReduce settings. Here you can set properties for the JobTracker and TaskTrackers, as well as some general and advanced properties. Click the name of the group to expand and collapse the display

Table 3.6. MapReduce Settings: JobTracker

Name	Notes
JobTracker host	This value is prepopulated based on your choices on previous screens. The host that has been assigned to run JobTracker.
JobTracker new generation size	Default size of Java new generation size for JobTracker (Java option -XX:NewSize)
JobTracker maximum new generation size	Maximum size of Java new generation for JobTracker (Java option -XX:MaxNewSize)
JobTracker maximum Java heap size	Maximum Java heap size for JobTracker in MB (Java option -Xmx)

Table 3.7. MapReduce Settings: TaskTracker

Name	Notes
TaskTracker hosts	This value is prepopulated based on your choices on previous screens. The hosts that have been assigned to run TaskTrackers.
MapReduce local directories	Directories for MapReduce to store intermediate data files
Number of Map slots per node	Number of slots that Map tasks that run simultaneously can occupy on a TaskTracker
Number of Reduce slots per node	Number of slots that Reduce tasks that run simultaneously can occupy on a TaskTracker.
Java options for MapReduce tasks	Java options for the TaskTracker child processes

Table 3.8. MapReduce Settings: General

Name	Notes
MapReduce Capacity Scheduler	The scheduler to use for scheduling MapReduce jobs
Cluster's Map slot size (virtual memory)	The virtual memory size of a single Map slot in the MapReduce framework. Use -1 for no limit
Cluster's Reduce slot size (virtual memory)	The virtual memory size of a single Reduce slot in the MapReduce framework. Use -1 for no limit
Upper limit on virtual memory for single Map task	Upper limit on virtual memory for single Map task. Use -1 for no limit.
Upper limit on virtual memory for single Reduce task	Upper limit on virtual memory for single Reduce task. Use -1 for no limit.
Default virtual memory for a job’s map-task	Virtual memory for single Map task. Use -1 for no limit.
Default virtual memory for a job's reduce-task	Virtual memory for single Reduce task. Use -1 for no limit.
Map-side sort buffer memory	The total amount of Map-side buffer memory to use while sorting files (Expert-only configuration)
Limit on buffer	Percentage of sort buffer used for record collection (Expert-only configuration)
Job log retention (hours)	The maximum time, in hours, for which the user-logs are to be retained after the job completion.
Maximum number tasks for a Job	Maximum number of tasks for a single Job. Use -1 for no limit.
LZO compression	Check to enable LZO compression in addition to Snappy
Snappy compression	Check to enable Snappy compression
Enable Job Diagnostics	Check to enable tools for tracing the path and troubleshooting the performance of MapReduce jobs

Table 3.9. MapReduce Settings: Advanced

Name	Notes
MapReduce system directories	MapReduce system directories
io.sort.record.percent
io.sort.factor
mapred.tasktracker.tasks.sleeptime-before-sigkill	Normally this is the amount of time before killing processes, and the recommended default is 5.000 seconds, a value of 5000 here. In this case it is used solely to blast tasks before killing them, and killing them very quickly (.25 second) to guarantee that we do not leave VMs around for later jobs
mapred.job.tracker.handler.count	The number of server threads for the JobTracker. Roughly 4% of the number of TaskTracker nodes.
mapreduce.cluster.administrators
mapred.reduce.parallel.copies
tasktracker.http.threads
mapred.map.tasks.speculative.execution	If `true`, then multiple instances of some map tasks may be executed in parallel
mapred.reduce.tasks.speculative.execution	If `true`, then multiple instances of some reduce tasks may be executed in parallel
mapred.reduce.slowstart.completed.maps
mapred.inmem.merge.threshold	The threshold, in terms of the number of files, for triggering the in-memory merge process. When the threshold is hit, we initiate the merge and spill to disk. A value of less than or equal to 0 means no threshold is set and ramfs's memory consumption triggers the merge.
mapred.job.shuffle.merge.percent	The threshold, expressed as a percentage of the total memory allocated to storing in-memory map outputs (defined in `mapred.job.shuffle.input.buffer.percent`), for triggering the in-memory merge process.
mapred.job.shuffle.input.buffer.percent	The percentage of memory to be allocated from the maximum heap size for storing map outputs during the shuffle.
mapred.output.compression.type	If the job outputs are to be compressed as SequenceFiles, how should they be compressed? Acceptable values are: NONE, RECORD, or BLOCK.
mapred.jobtracker.completeuserjobs.maximum
mapred.jobtracker.restart.recover	A value of `true` enables job recovery on restart; `false` starts afresh
mapred.job.reduce.input.buffer.percent	The percentage of memory relative to the maximum heap size. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
mapreduce.reduce.input.limit	The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that no limit is set.
mapred.task.timeout	The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, or updates its status string.
jetty.connector
mapred.child.root.logger
mapred.max.tracker.blacklists	If a node is reported blacklisted by this number of successful jobs within the timeout window, it will be graylisted.
mapred.healthChecker.interval
mapred.healthChecker.script.timeout
mapred.job.tracker.persist.jobstatus.active	Indicates if persistency of job status is active or not
mapred.job.tracker.persist.jobstatus.hours	The number of hours job status information is persisted in DFS. Job status information is available after it drops off the memory queue and between JobTracker restarts. A value of zero means that job status information is not persisted at all.
mapred.jobtracker.retirejob.check
mapred.jobtracker.retirejob.interval
mapred.job.tracker.history.completed.location
mapreduce.fileoutputcommitter.marksuccessfuljobs
mapred.job.reuse.jvm.num.tasks	The number of tasks to run per JVM. A value if -1 indicates no limit.
hadoop.job.history.user.location
mapreduce.jobtracker.staging.root.dir	The path prefix for the staging directories. The next level is always the user's name. It is a path in the default file system.
mapreduce.tasktracker.group	The group that the TaskTracker controller uses for accessing the controller. The mapred user must be a member and users should not be members.
mapreduce.jobtracker.split.metainfo.maxsize	If the size of the split metainfo file is larger than this value, the JobTacker will fail the job during initialization.
mapred.jobtracker.blacklist.fault-timeout-window	Sliding window in minutes
mapred.jobtracker.blacklist.fault-bucket-width	15 minute bucket size, in minutes
mapred.queue.names	Comma separated list of queues configured for this jobtracker
Custom MapReduce Configs	Use this text box to enter values for mapred-site.xml properties not exposed by the UI. Enter in "key=value" format, with a newline as a delimiter between pairs.

Legal notices