This document is intended for system administrators who need to configure HDFS compression on Windows platform.
Windows supports GzipCodec
, DefaultCodec
, and BZip2Codec
. Typically, GzipCodec
is popularly
used for HDFS compression.
Ensure that zlib1.dll
is installed in the
%HADOOP_HOME%\bin
directory on all the nodes of the cluster. Download the HDP
zlibl.dll from here.
Use the following instructions to use GZipCodec
Option I: To use GzipCodec with a one-time only job:
On the NamNode host machine, execute the following commands as
hdfs
user:hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort "-Dmapred.compress.map.output=true" "-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" "-Dmapred.output.compress=true" "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output
Option II: To enable GzipCodec as the default compression:
Edit the
core-site.xml
file on the NameNode host machine:<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec</value> <description>A list of the compression codec classes that can be used for compression/decompression.</description> </property>
Edit
mapred-site.xml
file on the JobTracker host machine:<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.GzipCodec</value> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property>
[Optional] - Enable the following two configuration parameters to enable job output compression.
Edit
mapred-site.xml
file on the Resource Manager host machine:<property> <name>mapred.output.compress</name> <value>true</value> </property> <property> <name>mapred.output.compression.codec</name> <value>org.apache.hadoop.io.compress.GzipCodec</value> </property>
Restart the cluster using instructions provided here.