kite-morphlines-core-stdio
readBlob
The readBlob
command (source code) converts a byte stream to a byte array
in main memory. It emits one record for the entire input stream of the first attachment,
interpreting the stream as a Binary Large Object (BLOB), i.e. emits a corresponding Java
byte array. The BLOB is put as a Java byte array into the _attachment_body output field by
default.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
outputField | _attachment_body | Name of the output field where the BLOB will be stored. |
Example usage:
readBlob {}
readClob
The readClob
command (source code) converts bytes to a string. It emits
one record for the entire input stream of the first attachment, interpreting the stream as a
Character Large Object (CLOB). The CLOB is put as a string into the message
output field by default.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
charset | null | he character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead. |
outputField | message | Name of the output field where the CLOB will be stored. |
Example usage:
readClob {
charset : UTF-8
}
readCSV
The readCSV
command (source code) extracts zero or more records from the
input stream of the first attachment of the record, representing a Comma Separated Values
(CSV) file.
For the format see this article.
Some CSV files contain a header line that contains embedded column names. This command does
not support reading and using such embedded column names as output field names because this
is considered unreliable for production systems. If the first line of the CSV file is a
header line, you must set the ignoreFirstLine
option to true. You must
explicitly define the columns
configuration parameter in order to name the
output fields.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
separator | "," | The character separating any two fields. Must be a string of length one. |
columns | n/a | The name of the output fields for each input column. An empty string indicates omit this column in the output. If more columns are contained in the input than specified here, those columns are automatically named columnN. |
ignoreFirstLine | false | Whether to ignore the first line. This flag can be used for CSV files that contain a header line. |
trim | true | Whether leading and trailing whitespace shall be removed from the output fields. |
addEmptyStrings | true | Whether or not to add zero length strings to the output fields. |
charset | null | The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead. |
quoteChar | "" | Must be a string of length zero or one. If this parameter is a String containing a single character then a quoted field can span multiple lines in the input stream. To disable quoting and multiline fields set this parameter to the empty string "". |
commentPrefix | "" | Must be a string of length zero or one, for example "#". If this parameter is a String containing a single character then lines starting with that character are ignored as comments. To disable the comment line feature set this parameter to the empty string "". |
maxCharactersPerRecord | 1000000 | Records longer than maxCharactersPerRecord characters are handled according to the policy specified in the onMaxCharactersPerRecord parameter described below. |
onMaxCharactersPerRecord | throwException | Records longer than maxCharactersPerRecord characters are handled according to the policy specified in the onMaxCharactersPerRecord parameter. Must be one of ignoreRecord or throwException. A value of ignoreRecord indicates to ignore such records and continue with the following record (warnings about such events are emitted to the log file). This value is typically used in production. A value of throwException indicates to throw an exception and fail hard in such cases. This value is typically used for testing. |
If the parameter quoteChar
is a String containing a single character then
a quoted field can span multiple lines in the input stream, for example as shown in the
following example CSV input containing a single record with three columns:
column0,"Look, new hot tub under redwood tree!
All bubbly!",column2
The above example can be parsed by specifying a double-quote character for the parameter
quoteChar
, using backslash syntax per the JSON specification, as
follows:
readCSV {
...
quoteChar : "\""
If the parameter commentPrefix
is a String containing a single character
then lines starting with that character are ignored as comments. Example:
#This is a comment line. It is ignored.
Example usage for CSV (Comma Separated Values):
readCSV {
separator : ","
columns : [Age,"",Extras,Type]
ignoreFirstLine : false
quoteChar : ""
commentPrefix : ""
trim : true
charset : UTF-8
}
Example usage for TSV (Tab Separated Values):
readCSV {
separator : "\t"
columns : [Age,"",Extras,Type]
ignoreFirstLine : false
quoteChar : ""
commentPrefix : ""
trim : true
charset : UTF-8
}
Example usage for SSV (Space Separated Values):
readCSV {
separator : " "
columns : [Age,"",Extras,Type]
ignoreFirstLine : false
quoteChar : ""
commentPrefix : ""
trim : true
charset : UTF-8
}
Example usage for Apache Hive (Values separated by non-printable CTRL-A character):
readCSV {
separator : "\u0001" # non-printable CTRL-A character
columns : [Age,"",Extras,Type]
ignoreFirstLine : false
quoteChar : ""
commentPrefix : ""
trim : false
charset : UTF-8
}
readLine
The readLine
command (source code) emits one record per line in the input
stream of the first attachment. The line is put as a string into the
message
output field. Empty lines are ignored.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
ignoreFirstLine | false | Whether to ignore the first line. This flag can be used for CSV files that contain a header line. |
commentPrefix | "" | A character that indicates to ignore this line as a comment for example, "#". To disable the comment line feature set this parameter to the empty string "". |
charset | null | The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead. |
Example usage:
readLine {
ignoreFirstLine : true
commentPrefix : "#"
charset : UTF-8
}
readMultiLine
The readMultiLine
command (source code) is a multiline log parser that
collapses multiple input lines into a single record, based on regular expression pattern
matching. It supports regex
, what
, and
negate
configuration parameters similar to logstash. The line is put as a
string into the message
output field.
For example, this can be used to parse log4j with stack traces. Also see https://gist.github.com/smougenot/3182192 and http://logstash.net/docs/1.1.13/filters/multiline.
The input stream or byte array is read from the first attachment of the input record.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
supportedMimeTypes | null | Optionally, require the input record to match one of the MIME types in this list. |
regex | n/a | This parameter should match what you believe to be an indicator that the line is part of a multi-line record. |
what | previous | This parameter must be one of "previous" or "next" and indicates the relation of the regex to the multi-line record. |
negate | false | This parameter can be true or false. If true, a line not matching the regex constitutes a match of the multiline filter and the previous or next action is applied. The reverse is also true. |
charset | null | The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead. |
Example usage:
# parse log4j with stack traces
readMultiLine {
regex : "(^.+Exception: .+)|(^\\s+at .+)|(^\\s+\\.\\.\\. \\d+ more)|(^\\s*Caused by:.+)"
what : previous
charset : UTF-8
}
# parse sessions; begin new record when we find a line that starts with "Started session"
readMultiLine {
regex : "Started session.*"
what : next
charset : UTF-8
}