kite-morphlines-hadoop-parquet-avro

This maven module contains morphline commands for handling Hadoop Avro Parquet files.

readAvroParquetFile

The readAvroParquetFile command (source code) parses a Hadoop Parquet file and emits a morphline record for each contained Avro datum.

The morphline record input field file_upload_url must contain the HDFS Path of the Parquet file to read. (This field is already provided out of the box with MapReduceIndexerTool).

For each Avro datum, the command emits a morphline record containing the datum as an attachment in the field _attachment_body. Typically, the emitted Avro datum is further post-processed with downstream commands such as extractAvroPaths.

Optionally, an Avro schema that shall be used for projecting parquet columns can be supplied with a configuration option.

The command provides the following configuration options:

Property Name Default Description
decimalConversionEnabled false When set to true, decimal Parquet data is correctly read instead of returning raw bytes.
projectionSchemaFile null An optional Avro schema file in JSON format on the local file system to use for projection. This Avro schema is converted to a parquet schema before applying the projection.
projectionSchemaString null An optional Avro schema in JSON format given inline to use for projection. This Avro schema is converted to a parquet schema before applying the projection.
readerSchemaFile null This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an Avro schema file in JSON format on the local file system to use for reading, as discussed above.
readerSchemaString null This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an optional Avro schema in JSON format given inline to use for reading. Has identical behaviour as the readerSchemaFile parameter described above.

Example usage:

readAvroParquetFile {
  # Optionally, use this Avro schema in JSON format inline for projection:
  # projectionSchemaString : """<json can go here>"""

  # Optionally, use this Avro schema file in JSON format for projection:
  # projectionSchemaFile : /path/to/syslog.avsc
}