kite-morphlines-hadoop-parquet-avro
This maven module contains morphline commands for handling Hadoop Avro Parquet files.
readAvroParquetFile
The readAvroParquetFile
command (source code) parses a Hadoop Parquet file and emits a
morphline record for each contained Avro datum.
The morphline record input field file_upload_url must contain the HDFS Path of the Parquet
file to read. (This field is already provided out of the box with
MapReduceIndexerTool
).
For each Avro datum, the command emits a morphline record containing the datum as an attachment in the field _attachment_body. Typically, the emitted Avro datum is further post-processed with downstream commands such as extractAvroPaths.
Optionally, an Avro schema that shall be used for projecting parquet columns can be supplied with a configuration option.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
decimalConversionEnabled | false | When set to true , decimal Parquet data is correctly read
instead of returning raw bytes. |
projectionSchemaFile | null | An optional Avro schema file in JSON format on the local file system to use for projection. This Avro schema is converted to a parquet schema before applying the projection. |
projectionSchemaString | null | An optional Avro schema in JSON format given inline to use for projection. This Avro schema is converted to a parquet schema before applying the projection. |
readerSchemaFile | null | This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an Avro schema file in JSON format on the local file system to use for reading, as discussed above. |
readerSchemaString | null | This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an optional Avro schema in JSON format given inline to use for reading. Has identical behaviour as the readerSchemaFile parameter described above. |
Example usage:
readAvroParquetFile {
# Optionally, use this Avro schema in JSON format inline for projection:
# projectionSchemaString : """<json can go here>"""
# Optionally, use this Avro schema file in JSON format for projection:
# projectionSchemaFile : /path/to/syslog.avsc
}