You can configure HDFS comrpession using GzipCodec. Gzip is CPU intensive, but provides a high compression ratio. Gzip is recommended for long term storage.
To configure compression using Gzip, for a one time job, you can execute the following command. This does not require that you restart your cluster.
hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort "-Dmapred.compress.map.output=true" "-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" "-Dmapred.output.compress=true" "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output
To configure Gzip as the default compression, edit your core-site.xml
configuration file as follows:
core-site.xml <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec</value> <description>A list of the compression codec classes that can be used for compression/decompression.</description> </property> mapred-site.xml <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.GzipCodec</value> </property> <property> <name>mapred.output.compression.type</name> <value<BLOCK</value> </property> <! – Enable the following two configs if you want to turn on job output compression. This is generally not done --> <property> <name>mapred.output.compress</name> <value>true</value> </property> <property> <name>mapred.output.compression.codec</name> <value>org.apache.hadoop.io.compress.GzipCodec</value> </property>