Job Input

InputFormat describes the input specification for a MapReduce job. The MapReduce framework relies on the InputFormat of the job to:

  1. Validate the input specification of the job.
  2. Split the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
  3. Provide the RecordReader implementation used to collect input records from the logical InputSplit for processing by the Mapper.

By default, file-based InputFormat implementations, which are typically subclasses of FileInputFormat, split the input into logical InputSplit instances based on the total size, in bytes, of the input files. The FileSystem block size of the input files is an upper bound for input splits. You can set a lower bound by using mapreduce.input.fileinputformat.split.minsize.

Logical splits based on input size are insufficient for many applications, because record boundaries must be respected. In such cases, the application should implement a RecordReader, which respects record boundaries and presents a record-oriented view of the logical InputSplit to the individual task.

TextInputFormat is the default InputFormat.

If TextInputFormat is the InputFormat for a given job, the framework detects input files with .gz extensions and automatically decompresses them using the appropriate CompressionCodec. However, compressed files with the .gz extension cannot be split; each compressed file is processed in its entirety by a single mapper.

InputSplit

InputSplit represents the data to be processed by an individual mapper. Typically, InputSplit presents a byte-oriented view of the input; the RecordReader processes and presents a record-oriented view.

FileSplit is the defaultInputSplit. It setsmapreduce.map.input.file to the path of the input file for the logical split.

RecordReader

RecordReader reads <key, value> pairs from an InputSplit. Typically, RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view to mapper implementations for processing. RecordReader assumes responsibility for processing record boundaries, and presents the tasks with keys and values.