kite-morphlines-core-stdio

readBlob

The readBlob command (source code) converts a byte stream to a byte array in main memory. It emits one record for the entire input stream of the first attachment, interpreting the stream as a Binary Large Object (BLOB), i.e. emits a corresponding Java byte array. The BLOB is put as a Java byte array into the _attachment_body output field by default.

The command provides the following configuration options:

Property Name Default Description
supportedMimeTypes null Optionally, require the input record to match one of the MIME types in this list.
outputField _attachment_body Name of the output field where the BLOB will be stored.

Example usage:

readBlob {}

readClob

The readClob command (source code) converts bytes to a string. It emits one record for the entire input stream of the first attachment, interpreting the stream as a Character Large Object (CLOB). The CLOB is put as a string into the message output field by default.

The command provides the following configuration options:

Property Name Default Description
supportedMimeTypes null Optionally, require the input record to match one of the MIME types in this list.
charset null he character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.
outputField message Name of the output field where the CLOB will be stored.

Example usage:

readClob {
  charset : UTF-8
}

readCSV

The readCSV command (source code) extracts zero or more records from the input stream of the first attachment of the record, representing a Comma Separated Values (CSV) file.

For the format see this article.

Some CSV files contain a header line that contains embedded column names. This command does not support reading and using such embedded column names as output field names because this is considered unreliable for production systems. If the first line of the CSV file is a header line, you must set the ignoreFirstLine option to true. You must explicitly define the columns configuration parameter in order to name the output fields.

The command provides the following configuration options:

Property Name Default Description
supportedMimeTypes null Optionally, require the input record to match one of the MIME types in this list.
separator "," The character separating any two fields. Must be a string of length one.
columns n/a The name of the output fields for each input column. An empty string indicates omit this column in the output. If more columns are contained in the input than specified here, those columns are automatically named columnN.
ignoreFirstLine false Whether to ignore the first line. This flag can be used for CSV files that contain a header line.
trim true Whether leading and trailing whitespace shall be removed from the output fields.
addEmptyStrings true Whether or not to add zero length strings to the output fields.
charset null The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.
quoteChar "" Must be a string of length zero or one. If this parameter is a String containing a single character then a quoted field can span multiple lines in the input stream. To disable quoting and multiline fields set this parameter to the empty string "".
commentPrefix "" Must be a string of length zero or one, for example "#". If this parameter is a String containing a single character then lines starting with that character are ignored as comments. To disable the comment line feature set this parameter to the empty string "".
maxCharactersPerRecord 1000000 Records longer than maxCharactersPerRecord characters are handled according to the policy specified in the onMaxCharactersPerRecord parameter described below.
onMaxCharactersPerRecord throwException Records longer than maxCharactersPerRecord characters are handled according to the policy specified in the onMaxCharactersPerRecord parameter. Must be one of ignoreRecord or throwException. A value of ignoreRecord indicates to ignore such records and continue with the following record (warnings about such events are emitted to the log file). This value is typically used in production. A value of throwException indicates to throw an exception and fail hard in such cases. This value is typically used for testing.

If the parameter quoteChar is a String containing a single character then a quoted field can span multiple lines in the input stream, for example as shown in the following example CSV input containing a single record with three columns:

column0,"Look, new hot tub under redwood tree!
All bubbly!",column2

The above example can be parsed by specifying a double-quote character for the parameter quoteChar, using backslash syntax per the JSON specification, as follows:

readCSV {
  ...
  quoteChar : "\""

If the parameter commentPrefix is a String containing a single character then lines starting with that character are ignored as comments. Example:

#This is a comment line. It is ignored.

Example usage for CSV (Comma Separated Values):

readCSV {
  separator : ","
  columns : [Age,"",Extras,Type]
  ignoreFirstLine : false
  quoteChar : ""
  commentPrefix : ""
  trim : true
  charset : UTF-8
}

Example usage for TSV (Tab Separated Values):

readCSV {
  separator : "\t"
  columns : [Age,"",Extras,Type]
  ignoreFirstLine : false
  quoteChar : ""
  commentPrefix : ""
  trim : true
  charset : UTF-8
}

Example usage for SSV (Space Separated Values):

readCSV {
  separator : " "
  columns : [Age,"",Extras,Type]
  ignoreFirstLine : false
  quoteChar : ""
  commentPrefix : ""
  trim : true
  charset : UTF-8
}

Example usage for Apache Hive (Values separated by non-printable CTRL-A character):

readCSV {
  separator : "\u0001" # non-printable CTRL-A character
  columns : [Age,"",Extras,Type]
  ignoreFirstLine : false
  quoteChar : ""
  commentPrefix : ""
  trim : false
  charset : UTF-8
}

readLine

The readLine command (source code) emits one record per line in the input stream of the first attachment. The line is put as a string into the message output field. Empty lines are ignored.

The command provides the following configuration options:

Property Name Default Description
supportedMimeTypes null Optionally, require the input record to match one of the MIME types in this list.
ignoreFirstLine false Whether to ignore the first line. This flag can be used for CSV files that contain a header line.
commentPrefix "" A character that indicates to ignore this line as a comment for example, "#". To disable the comment line feature set this parameter to the empty string "".
charset null The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.

Example usage:

readLine {
  ignoreFirstLine : true
  commentPrefix : "#"
  charset : UTF-8
}

readMultiLine

The readMultiLine command (source code) is a multiline log parser that collapses multiple input lines into a single record, based on regular expression pattern matching. It supports regex, what, and negate configuration parameters similar to logstash. The line is put as a string into the message output field.

For example, this can be used to parse log4j with stack traces. Also see https://gist.github.com/smougenot/3182192 and http://logstash.net/docs/1.1.13/filters/multiline.

The input stream or byte array is read from the first attachment of the input record.

The command provides the following configuration options:

Property Name Default Description
supportedMimeTypes null Optionally, require the input record to match one of the MIME types in this list.
regex n/a This parameter should match what you believe to be an indicator that the line is part of a multi-line record.
what previous This parameter must be one of "previous" or "next" and indicates the relation of the regex to the multi-line record.
negate false This parameter can be true or false. If true, a line not matching the regex constitutes a match of the multiline filter and the previous or next action is applied. The reverse is also true.
charset null The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.

Example usage:

# parse log4j with stack traces
readMultiLine {
  regex : "(^.+Exception: .+)|(^\\s+at .+)|(^\\s+\\.\\.\\. \\d+ more)|(^\\s*Caused by:.+)"
  what : previous
  charset : UTF-8
}

# parse sessions; begin new record when we find a line that starts with "Started session"
readMultiLine {
  regex : "Started session.*"
  what : next
  charset : UTF-8
}