kite-morphlines-hadoop-parquet-avro
This maven module contains morphline commands for handling Hadoop Avro Parquet files.
readAvroParquetFile
The readAvroParquetFile
command (source code) parses a Hadoop Parquet file and emits a
morphline record for each contained Avro datum.
The morphline record input field file_upload_url must contain the HDFS Path of the Parquet
file to read. (This field is already provided out of the box with
MapReduceIndexerTool
).
For each Avro datum, the command emits a morphline record containing the datum as an attachment in the field _attachment_body. Typically, the emitted Avro datum is further post-processed with downstream commands such as extractAvroPaths.
Optionally, an Avro schema that shall be used for projecting parquet columns can be supplied with a configuration option.
In CDH 5.0 and beyond an additional Avro reader schema parameter can be specified. For
Parquet files that were not written with the parquet.avro
package (e.g.
Impala Parquet files) there is no Avro write schema stored in the Parquet file metadata. To
read such files using the readAvroParquetFile
command you must either
provide an Avro reader schema via the readerSchemaFile
parameter, or a
default Avro schema will be derived using the standard mapping specification. Prior to CDH 5.0 the
implementation required that the parquet file contains an explicit Avro schema, e.g. as
written by the parquet.avro.AvroParquetWriter
class, and the reader schema
was always retrieved from the parquet file.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
decimalConversionEnabled | false | When set to true , decimal Parquet data is correctly read
instead of returning raw bytes. |
projectionSchemaFile | null | An optional Avro schema file in JSON format on the local file system to use for projection. This Avro schema is converted to a parquet schema before applying the projection. |
projectionSchemaString | null | An optional Avro schema in JSON format given inline to use for projection. This Avro schema is converted to a parquet schema before applying the projection. |
readerSchemaFile | null | This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an Avro schema file in JSON format on the local file system to use for reading, as discussed above. |
readerSchemaString | null | This optional parameter is available in CDH 5.0 and beyond. This optional parameter specifies an optional Avro schema in JSON format given inline to use for reading. Has identical behaviour as the readerSchemaFile parameter described above. |
Example usage:
readAvroParquetFile { # Optionally, use this Avro schema in JSON format inline for projection: # projectionSchemaString : """<json can go here>""" # Optionally, use this Avro schema file in JSON format for projection: # projectionSchemaFile : /path/to/syslog.avsc }