Using Snappy for MapReduce Compression
It's very common to enable MapReduce intermediate compression, since this can make jobs run faster without you having to make any application changes. Only the temporary intermediate files created by Hadoop for the shuffle phase are compressed (the final output may or may not be compressed). Snappy is ideal in this case because it compresses and decompresses very fast compared to other compression algorithms, such as Gzip. For information about choosing a compression format, see Choosing a Data Compression Format.
To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in mapred-site.xml:
- For MRv1:
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
- For YARN:
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapred.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
You can also set these properties on a per-job basis.
Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis.
MRv1 Property |
YARN Property |
Description |
---|---|---|
mapred.output.compress |
mapreduce.output.fileoutputformat.compress |
Whether to compress the final job outputs (true or false) |
mapred.output.compression.codec |
mapreduce.output.fileoutputformat.compress.codec |
If the final job outputs are to be compressed, which codec should be used. Set to org.apache.hadoop.io.compress.SnappyCodec for Snappy compression. |
mapred.output.compression.type |
mapreduce.output.fileoutputformat.compress.type |
For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended. |