Snappy Compression

Snappy is supported for all CDH components. How you specify compression depends on the component.

Using Snappy with HBase

If you install Hadoop and HBase from RPM or Debian packages, Snappy requires no HBase configuration.

Using Snappy with Hive or Impala

To enable Snappy compression for Hive output when creating SequenceFile outputs, use the following settings:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

For information about configuring Snappy compression for Parquet files with Hive, see Using Parquet Tables in Hive. For information about using Snappy compression for Parquet files with Impala, see Snappy and GZip Compression for Parquet Data Files in the Impala Guide.

Using Snappy with MapReduce

Enabling MapReduce intermediate compression can make jobs run faster without requiring application changes. Only the temporary intermediate files created by Hadoop for the shuffle phase are compressed; the final output may or may not be compressed. Snappy is ideal in this case because it compresses and decompresses very quickly compared to other compression algorithms, such as Gzip. For information about choosing a compression format, see Choosing and Configuring Data Compression.

To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties in mapred-site.xml:

  • MRv1
    <property>
      <name>mapred.compress.map.output</name>
      <value>true</value>
    </property>
    <property>
      <name>mapred.map.output.compression.codec</name>
      <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>
  • YARN
    <property>
      <name>mapreduce.map.output.compress</name>
      <value>true</value>
    </property>
    <property>
      <name>mapreduce.map.output.compress.codec</name>
      <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>

You can also set these properties on a per-job basis.

Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis.

MRv1 Property YARN Property Description
mapred.output.compress
mapreduce.output.
fileoutputformat.
compress
Whether to compress the final job outputs (true or false).
mapred.output.
compression.codec
mapreduce.output.
fileoutputformat.
compress.codec
If the final job outputs are to be compressed, the codec to use. Set to org.apache.hadoop.io.compress.SnappyCodec for Snappy compression.
mapred.output.
compression.type
mapreduce.output.
fileoutputformat.
compress.type
For SequenceFile outputs, e type of compression to use (NONE, RECORD, or BLOCK). Cloudera recommends BLOCK.

Using Snappy with Pig

Set the same properties for Pig as for MapReduce.

Using Snappy with Spark SQL

To enable Snappy compression for Spark SQL when writing tables, specify the snappy codec in the spark.sql.parquet.compression.codec configuration:
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy") 

Using Snappy Compression with Sqoop 1 Imports

  • Sqoop 1 - On the command line, use the following option to enable Snappy compression:
    --compression-codec org.apache.hadoop.io.compress.SnappyCodec

    Cloudera recommends using the --as-sequencefile option with this compression option.