kite-morphlines-tika-core
This maven module contains morphline commands for autodetecting MIME types from binary data. Depends on tika-core.
detectMimeType
The detectMimeType
command (source code) uses Apache Tika to autodetect the
MIME type of the first attachment from the binary data. The
detected MIME type is assigned to the _attachment_mimetype field.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
includeDefaultMimeTypes | true | Whether to include the Tika default MIME types file that ships embedded in tika-core.jar (see http://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml) |
mimeTypesFiles | [] | The relative or absolute path of zero or more Tika custom-mimetypes.xml files to include. |
mimeTypesString | null | The content of an optional custom-mimetypes.xml file embedded directly inside of this morphline configuration file. |
preserveExisting | true | Whether to preserve the _attachment_mimetype field value if one is already present. |
includeMetaData | false | Whether to pass the record fields to Tika to assist in MIME type detection. |
excludeParameters | true | Whether to remove MIME parameters from output MIME type. |
Example usage:
detectMimeType { includeDefaultMimeTypes : false #mimeTypesFiles : [src/test/resources/custom-mimetypes.xml] mimeTypesString : """ <mime-info> <mime-type type="text/space-separated-values"> <glob pattern="*.ssv"/> </mime-type> <mime-type type="avro/binary"> <magic priority="50"> <match value="0x4f626a01" type="string" offset="0"/> </magic> <glob pattern="*.avro"/> </mime-type> <mime-type type="mytwittertest/json+delimited+length"> <magic priority="50"> <match value="[0-9]+(\r)?\n\\{"" type="regex" offset="0:16"/> </magic> </mime-type> </mime-info> """ }