Cloudera Search Morphlines ReferencePDF version

kite-morphlines-hadoop-parquet-avro

This maven module contains morphline commands for handling Hadoop Avro Parquet files.

The readAvroParquetFile command (source code) parses a Hadoop Parquet file and emits a morphline record for each contained Avro datum.

The morphline record input field file_upload_url must contain the HDFS Path of the Parquet file to read. (This field is already provided out of the box with MapReduceIndexerTool).

For each Avro datum, the command emits a morphline record containing the datum as an attachment in the field _attachment_body. Typically, the emitted Avro datum is further post-processed with downstream commands such as extractAvroPaths.

Optionally, an Avro schema that shall be used for projecting parquet columns can be supplied with a configuration option.

In CDH 5.0 and beyond an additional Avro reader schema parameter can be specified. For Parquet files that were not written with the parquet.avro package (e.g. Impala Parquet files) there is no Avro write schema stored in the Parquet file metadata. To read such files using the readAvroParquetFile command you must either provide an Avro reader schema via the readerSchemaFile parameter, or a default Avro schema will be derived using the standard mapping specification. Prior to CDH 5.0 the implementation required that the parquet file contains an explicit Avro schema, e.g. as written by the parquet.avro.AvroParquetWriter class, and the reader schema was always retrieved from the parquet file.

The command provides the following configuration options:

Property Name Default Description
decimalConversionEnabled false When set to true, decimal Parquet data is correctly read instead of returning raw bytes.
projectionSchemaFile null An optional Avro schema file in JSON format on the local file system to use for projection. This Avro schema is converted to a parquet schema before applying the projection.
projectionSchemaString null An optional Avro schema in JSON format given inline to use for projection. This Avro schema is converted to a parquet schema before applying the projection.
readerSchemaFile null This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an Avro schema file in JSON format on the local file system to use for reading, as discussed above.
readerSchemaString null This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an optional Avro schema in JSON format given inline to use for reading. Has identical behaviour as the readerSchemaFile parameter described above.

Example usage:

readAvroParquetFile {
  # Optionally, use this Avro schema in JSON format inline for projection:
  # projectionSchemaString : """<json can go here>"""

  # Optionally, use this Avro schema file in JSON format for projection:
  # projectionSchemaFile : /path/to/syslog.avsc
}

We want your opinion

How can we improve this page?

What kind of feedback do you have?