Choosing a Data Compression Format

Whether to compress your data and which compression formats to use can have a significant impact on performance. Two of the most important places to consider data compression are in terms of MapReduce jobs and data stored in HBase. For the most part, the principles are similar for each.

Continue reading:

General Guidelines
Configuring Data Compression Using Cloudera Manager
Configuring Data Compression Using the Command Line
Further Reading

General Guidelines

You need to balance the processing capacity required to compress and uncompress the data, the disk IO required to read and write the data, and the network bandwidth required to send the data across the network. The correct balance of these factors depends upon the characteristics of your cluster and your data, as well as your usage patterns.
Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.
GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.
BZip2 can also produce more compression than GZip for some types of files, at the cost of some speed when compressing and decompressing. HBase does not support BZip2 compression.
Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.
For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Splittability is not relevant to HBase data.

For MapReduce, you can compress either the intermediate data, the output, or both. Adjust the parameters you provide for the MapReduce job accordingly. The following examples compress both the intermediate data and the output. MR2 is shown first, followed by MR1.

MR2

hadoop jar hadoop-examples-.jar sort "-Dmapreduce.compress.map.output=true"
      "-Dmapreduce.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
      "-Dmapreduce.output.compress=true"
      "-Dmapreduce.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey
      org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output

MR1

hadoop jar hadoop-examples-.jar sort "-Dmapred.compress.map.output=true"
      "-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"
      "-Dmapred.output.compress=true"
      "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey
      org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output

Configuring Data Compression Using Cloudera Manager

To configure support for LZO using Cloudera Manager, you must install the GPL Extras package, then configure services to use it. See Installing GPL Extras and Configuring Services to Use the GPL Extras Parcel.

Configuring Data Compression Using the Command Line

To configure support for LZO in CDH, see Step 5: (Optional) Install LZO and Configuring LZO. Snappy support is included in CDH.

To use Snappy in a MapReduce job, see Using Snappy for MapReduce Compression. Use the same method for LZO, with the codec com.hadoop.compression.lzo.LzopCodec instead.

Choosing a Data Compression Format

General Guidelines

Configuring Data Compression Using Cloudera Manager

Configuring Data Compression Using the Command Line

Further Reading