Using Apache Avro Data Files with CDH

Apache Avro is a serialization system. Avro supports rich data structures, a compact binary encoding, and a container file for sequences of Avro data (often referred to as Avro data files). Avro is language-independent and there are several language bindings for it, including Java, C, C++, Python, and Ruby.

Avro data files have the .avro extension. Make sure the files you create have this extension, because some tools use it to determine which files to process as Avro (for example, AvroInputFormat and AvroAsTextInputFormat for MapReduce and streaming).

Avro does not rely on generated code, so processing data imported from Flume or Sqoop 1 is simpler than using Hadoop Writables in SequenceFiles, where you must ensure that the generated classes are on the processing job classpath. Pig and Hive cannot easily process SequenceFiles with custom Writables, so users often revert to using text, which has disadvantages in compactness and compressibility. Compressed text is not generally splittable, making it difficult to process efficiently using MapReduce.

All components in CDH that produce or consume files support Avro data files.

Compression for Avro Data Files

By default Avro data files are not compressed, but Cloudera recommends enabling compression to reduce disk usage and increase read and write performance. Avro data files support Deflate and Snappy compression. Snappy is faster, but Deflate is slightly more compact.

You do not need to specify configuration to read a compressed Avro data file. However, to write an Avro data file, you must specify the type of compression. How you specify compression depends on the component.

Using Avro Data Files in Flume

The HDFSEventSink used to serialize event data onto HDFS supports plug-in implementations of theEventSerializer interface. Implementations of this interface have full control over the serialization format and can be used in cases where the default serialization format provided by the sink is insufficient.

An abstract implementation of the EventSerializer interface, called AbstractAvroEventSerializer, is provided with Flume. This class can be extended to support custom schemas for Avro serialization over HDFS. The FlumeEventAvroEventSerializer class provides a simple implementation that maps the events to a representation of a String header map and byte payload in Avro. Use this class by setting the serializer property of the sink as follows:

agent-name.sinks.sink-name.serializer = AVRO_EVENT

Using Avro Data Files in Hive

The following example demonstrates how to create a Hive table backed by Avro data files:

SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
TBLPROPERTIES ('avro.schema.literal'='{
  "namespace": "testing.hive.avro.serde",
  "name": "doctors",
  "type": "record",
  "fields": [
      "doc":"Order of playing the role"
      "doc":"first name of actor playing role"
      "doc":"last name of actor playing role"
      "doc:":"an extra field not in the original file",
      "default":"fishfingers and custard"

LOAD DATA LOCAL INPATH '/usr/share/doc/hive-0.7.1+42.55/examples/files/doctors.avro' INTO TABLE doctors;

You can also create an Avro backed Hive table by using an Avro schema file:

CREATE TABLE my_avro_table(notused INT)

avro.schema.url is a URL (here a file:// URL) pointing to an Avro schema file used for reading and writing. It could also be an hdfs: URL; for example, hdfs://hadoop-namenode-uri/examplefile.

To enable Snappy compression on output files, run the following before writing to the table:

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

Also include the snappy-java JAR in --auxpath, which is located at /usr/lib/hive/lib/snappy-java- or /opt/cloudera/parcels/CDH/lib/hive/lib/snappy-java-

Haivvreo SerDe has been merged into Hive as AvroSerDe and is no longer supported in its original form. schema.url and schema.literal have been changed to avro.schema.url and avro.schema.literal as a result of the merge. If you were using Haivvreo SerDe, you can use the Hive AvroSerDe with tables created with the Haivvreo SerDe. For example, if you have a table my_avro_table that uses the Haivvreo SerDe, add the following to make the table use the new AvroSerDe:

ALTER TABLE my_avro_table SET SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe';


Using Avro Data Files in MapReduce

The Avro MapReduce API is an Avro module for running MapReduce programs that produce or consume Avro data files.

If you are using Maven, add the following dependency to your POM:


Then write your program, using the Avro MapReduce javadoc for guidance.

At run time, include the avro and avro-mapred JARs in the HADOOP_CLASSPATH and the avro, avro-mapred and paranamer JARs in -libjars.

To enable Snappy compression on output, call AvroJob.setOutputCodec(job, "snappy") when configuring the job. You must also include the snappy-java JAR in -libjars.

Using Avro Data Files in Pig

CDH provides AvroStorage for Avro integration in Pig.

To use it, first register the piggybank JAR file and supporting libraries:

REGISTER piggybank.jar
REGISTER lib/avro-1.7.3.jar
REGISTER lib/json-simple-1.1.jar
REGISTER lib/snappy-java-

Then load Avro data files as follows:

a = LOAD 'my_file.avro' USING;

Pig maps the Avro schema to a corresponding Pig schema.

You can store data in Avro data files with:

store b into 'output' USING;

With store, Pig generates an Avro schema from the Pig schema. You can override the Avro schema by specifying it literally as a parameter to AvroStorage or by using the same schema as an existing Avro data file. See the Pig wiki for details.

To store two relations in one script, specify an index to each store function. For example:

set1 = load 'input1.txt' using PigStorage() as ( ... );
store set1 into 'set1' using'index', '1');

set2 = load 'input2.txt' using PigStorage() as ( ... );
store set2 into 'set2' using'index', '2');

For more information, search for "index" in the AvroStorage wiki.

To enable Snappy compression on output files, do the following before issuing the STORE statement:

SET mapred.output.compress true
SET mapred.output.compression.codec
SET avro.output.codec snappy

For more information, see the Pig wiki. The version numbers of the JAR files to register are different on that page, so adjust them as shown above.

Importing Avro Data Files in Sqoop 1

On the command line, use the following option to import Avro data files:


Sqoop 1 automatically generates an Avro schema that corresponds to the database table being exported from.

To enable Snappy compression, add the following option:

--compression-codec snappy

Using Avro Data Files in Impala

Impala can query Avro files, but currently cannot write them. When you use Avro files with Impala, you typically create the data files using Hive or Spark, and then use Impala for analytic queries. See Using the Avro File Format with Impala Tables for details.

For new data pipelines, where you do not already have existing data in Avro format, consider using Parquet data files. Parquet files are optimized for the kinds of data warehouse-style queries typically done in Impala. See Using the Parquet File Format with Impala Tables for details.

Using Avro Data Files in Spark

Using Avro Data Files in Streaming Programs

To read from Avro data files from a streaming program, specify org.apache.avro.mapred.AvroAsTextInputFormat as the input format. This format converts each datum in the Avro data file to a string. For a "bytes" schema, this is the raw bytes; in general cases, this is a single-line JSON representation.

To write to Avro data files from a streaming program, specify org.apache.avro.mapred.AvroTextOutputFormat as the output format. This format creates Avro data files with a "bytes" schema, where each datum is a tab-delimited key-value pair.

At run time, specify the avro, avro-mapred, and paranamer JARs in -libjars in the streaming command.

To enable Snappy compression on output files, set the property avro.output.codec to snappy. You must also include the snappy-java JAR in -libjars.