10. Configuring HDFS Compression using GZIP

You can configure HDFS comrpession using GzipCodec. Gzip is CPU intensive, but provides a high compression ratio. Gzip is recommended for long term storage.

To configure compression using Gzip, for a one time job, you can execute the following command. This does not require that you restart your cluster.

hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort 
"-Dmapred.compress.map.output=true" 
"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" 
"-Dmapred.output.compress=true" 
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" 
-outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output
    

To configure Gzip as the default compression, edit your core-site.xml configuration file as follows:

core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
<description>A list of the compression codec classes that can be used
for compression/decompression.</description>
</property>
mapred-site.xml
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value<BLOCK</value>
</property>
<! – Enable the following two configs if you want to turn on job output compression. This is generally not done -->
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
    

loading table of contents...