Using Snappy for MapReduce Compression
It's very common to enable MapReduce intermediate compression, since this can make jobs run faster without you having to make any application changes. Only the temporary intermediate files created by Hadoop for the shuffle phase are compressed (the final output may or may not be compressed). Snappy is ideal in this case because it compresses and decompresses very fast compared to other compression algorithms, such as Gzip.
To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in mapred-site.xml:
- For MRv1:
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
- For YARN:
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapred.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
You can also set these properties on a per-job basis.
Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis.
MRv1 Property |
YARN Property |
Description |
---|---|---|
mapred.output.compress |
mapreduce.output.fileoutputformat.compress |
Whether to compress the final job outputs (true or false) |
mapred.output.compression.codec |
mapreduce.output.fileoutputformat.compress.codec |
If the final job outputs are to be compressed, which codec should be used. Set to org.apache.hadoop.io.compress.SnappyCodec for Snappy compression. |
mapred.output.compression.type |
mapreduce.output.fileoutputformat.compress.type |
For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended. |
The MRv1 property names are also supported (though deprecated) in MRv2 (YARN), so it's not mandatory to update them in this release.
<< Installing Snappy | Using Snappy for Pig Compression >> | |