kite-morphlines-hadoop-rcfile

readRCFile

The readRCFile command (source code) parses an Apache Hadoop RCFile. An RCFile can be read Row-wise or Column-wise, as follows:

  • Row-wise: One morphline record is emitted for each row in the RCFile. Each record will contain fields for all the columns configured in the columns parameter. For example, with an RCFile with 10 rows and 5 columns, Row-wise mode would emit 10 morphline records.

  • Column-wise: For every row-split (block) in the RC File, Emits one record for each column specified in the columns parameter. This record will contain a list of row values for that column as a list. The order of columns is as specified in the columns parameter. For example, with an RCFile with 10 rows and 5 columns, Column-wise mode would emit 5 morphline records each with a list of 10 values assuming there are no row-splits.

The InputStream of the RCFile is read from the _attachment_body field of the input record. Optionally, the name of the RCFile is read from the _attachment_name field of the input record. Providing a name for the InputStream will, in case of errors, result in error messages containing said name for better debugging and diagnostics.

The command automatically handles compressed RCFiles.

The command provides the following configuration options:

Property Name Default Description
readMode row Valid values: row or column. Defines the reading strategy. Affects the structure and number of output records that will be emitted, as discussed above.
includeMetaData false Whether or not the RCFile metadata shall be included in the output record.
columns n/a A list of column configurations. Since an RCFile does not store meta information about the columns itself, this configuration is necessary to read the RCFile.

The columns configuration has the following configuration options:

Property Name Default Description
inputField n/a Non-negative integer index of the RCFile column to read.
outputField n/a The name of the field to add output values to. The output record will have the value of the column added to this field.
writableClass n/a Fully Qualified Class Name of a sub-class of org.apache.hadoop.io.Writable. Instances of this class are used to read the value of the column bytes, and said instances are added to the outputField. For example org.apache.hadoop.io.Text or org.apache.hadoop.io.LongWritable or org.apache.hadoop.hive.serde2.columnar.BytesRefWritable

Example usage:

readRCFile {
  readMode: row
  includeMetaData: false
  columns: [
    {
      inputField: 0
      outputField: name
      writableClass: "org.apache.hadoop.io.Text"
    }
    {
      inputField: 3
      outputField: age
      writableClass: "org.apache.hadoop.io.LongWritable"
    }
    {
      inputField: 1000
      outputField: photo
      writableClass: "org.apache.hadoop.hive.serde2.columnar.BytesRefWritable"
    }
  ]
}