Using Avro with Pig

CDH provides AvroStorage for Avro integration in Pig.

To use it, first register the piggybank JAR file and supporting libraries:

REGISTER piggybank.jar
REGISTER lib/avro-1.7.3.jar
REGISTER lib/json-simple-1.1.jar
REGISTER lib/snappy-java-

Then you can load Avro data files as follows:

a = LOAD 'my_file.avro' USING;

Pig maps the Avro schema to a corresponding Pig schema.

You can store data in Avro data files with:

store b into 'output' USING;

In the case of store, Pig generates an Avro schema from the Pig schema. It is possible to override the Avro schema, either by specifying it literally as a parameter to AvroStorage, or by using the same schema as an existing Avro data file. See the Pig wiki for details.

To store two relations in one script, specify an index to each store function. Here is an example:

set1 = load 'input1.txt' using PigStorage() as ( ... );
store set1 into 'set1' using'index', '1');

set2 = load 'input2.txt' using PigStorage() as ( ... );
store set2 into 'set2' using'index', '2');

For more information, see the AvroStorage wiki; look for "index".

To enable Snappy compression on output files do the following before issuing the STORE statement:

SET mapred.output.compress true
SET mapred.output.compression.codec
SET avro.output.codec snappy

There is some additional documentation on the Pig wiki. Note, however, that the version numbers of the JAR files to register are different on that page, so you should adjust them as shown above.