kite-morphlines-tika-core
This maven module contains morphline commands for autodetecting MIME types from binary data. Depends on tika-core.
detectMimeType
The detectMimeType
command (source code) uses Apache Tika to autodetect the
MIME type of the first attachment from the binary data. The
detected MIME type is assigned to the _attachment_mimetype field.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
includeDefaultMimeTypes | true | Whether to include the Tika default MIME types file that ships embedded in tika-core.jar (see http://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml) |
mimeTypesFiles | [] | The relative or absolute path of zero or more Tika custom-mimetypes.xml files to include. |
mimeTypesString | null | The content of an optional custom-mimetypes.xml file embedded directly inside of this morphline configuration file. |
preserveExisting | true | Whether to preserve the _attachment_mimetype field value if one is already present. |
includeMetaData | false | Whether to pass the record fields to Tika to assist in MIME type detection. |
excludeParameters | true | Whether to remove MIME parameters from output MIME type. |
Example usage:
detectMimeType {
includeDefaultMimeTypes : false
#mimeTypesFiles : [src/test/resources/custom-mimetypes.xml]
mimeTypesString :
"""
<mime-info>
<mime-type type="text/space-separated-values">
<glob pattern="*.ssv"/>
</mime-type>
<mime-type type="avro/binary">
<magic priority="50">
<match value="0x4f626a01" type="string" offset="0"/>
</magic>
<glob pattern="*.avro"/>
</mime-type>
<mime-type type="mytwittertest/json+delimited+length">
<magic priority="50">
<match value="[0-9]+(\r)?\n\\{"" type="regex" offset="0:16"/>
</magic>
</mime-type>
</mime-info>
"""
}