kite-morphlines-tika-core
This maven module contains morphline commands for autodetecting MIME types from binary data. Depends on tika-core.
detectMimeType
The detectMimeType command (source code) uses Apache Tika to autodetect the
MIME type of the first attachment from the binary data. The
detected MIME type is assigned to the _attachment_mimetype field.
The command provides the following configuration options:
| Property Name | Default | Description |
|---|---|---|
| includeDefaultMimeTypes | true | Whether to include the Tika default MIME types file that ships embedded in tika-core.jar (see http://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml) |
| mimeTypesFiles | [] | The relative or absolute path of zero or more Tika custom-mimetypes.xml files to include. |
| mimeTypesString | null | The content of an optional custom-mimetypes.xml file embedded directly inside of this morphline configuration file. |
| preserveExisting | true | Whether to preserve the _attachment_mimetype field value if one is already present. |
| includeMetaData | false | Whether to pass the record fields to Tika to assist in MIME type detection. |
| excludeParameters | true | Whether to remove MIME parameters from output MIME type. |
Example usage:
detectMimeType {
includeDefaultMimeTypes : false
#mimeTypesFiles : [src/test/resources/custom-mimetypes.xml]
mimeTypesString :
"""
<mime-info>
<mime-type type="text/space-separated-values">
<glob pattern="*.ssv"/>
</mime-type>
<mime-type type="avro/binary">
<magic priority="50">
<match value="0x4f626a01" type="string" offset="0"/>
</magic>
<glob pattern="*.avro"/>
</mime-type>
<mime-type type="mytwittertest/json+delimited+length">
<magic priority="50">
<match value="[0-9]+(\r)?\n\\{"" type="regex" offset="0:16"/>
</magic>
</mime-type>
</mime-info>
"""
}
