kite-morphlines-hadoop-rcfile
readRCFile
The readRCFile
command (source code) parses an Apache Hadoop RCFile. An RCFile can be read Row-wise or
Column-wise, as follows:
-
Row-wise: One morphline record is emitted for each row in the RCFile. Each record will contain fields for all the columns configured in the
columns
parameter. For example, with an RCFile with 10 rows and 5 columns, Row-wise mode would emit 10 morphline records. -
Column-wise: For every row-split (block) in the RC File, Emits one record for each column specified in the
columns
parameter. This record will contain a list of row values for that column as a list. The order of columns is as specified in thecolumns
parameter. For example, with an RCFile with 10 rows and 5 columns, Column-wise mode would emit 5 morphline records each with a list of 10 values assuming there are no row-splits.
The InputStream of the RCFile is read from the _attachment_body field of the input record. Optionally, the name of the RCFile is read from the _attachment_name field of the input record. Providing a name for the InputStream will, in case of errors, result in error messages containing said name for better debugging and diagnostics.
The command automatically handles compressed RCFiles.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
readMode | row | Valid values: row or column. Defines the reading strategy. Affects the structure and number of output records that will be emitted, as discussed above. |
includeMetaData | false | Whether or not the RCFile metadata shall be included in the output record. |
columns | n/a | A list of column configurations. Since an RCFile does not store meta information about the columns itself, this configuration is necessary to read the RCFile. |
The columns configuration has the following configuration options:
Property Name | Default | Description |
---|---|---|
inputField | n/a | Non-negative integer index of the RCFile column to read. |
outputField | n/a | The name of the field to add output values to. The output record will have the value of the column added to this field. |
writableClass | n/a | Fully Qualified Class Name of a sub-class of org.apache.hadoop.io.Writable. Instances of this class are used to read the value of the column bytes, and said instances are added to the outputField. For example org.apache.hadoop.io.Text or org.apache.hadoop.io.LongWritable or org.apache.hadoop.hive.serde2.columnar.BytesRefWritable |
Example usage:
readRCFile {
readMode: row
includeMetaData: false
columns: [
{
inputField: 0
outputField: name
writableClass: "org.apache.hadoop.io.Text"
}
{
inputField: 3
outputField: age
writableClass: "org.apache.hadoop.io.LongWritable"
}
{
inputField: 1000
outputField: photo
writableClass: "org.apache.hadoop.hive.serde2.columnar.BytesRefWritable"
}
]
}