kite-morphlines-tika-core

This maven module contains morphline commands for autodetecting MIME types from binary data. Depends on tika-core.

detectMimeType

The detectMimeType command (source code) uses Apache Tika to autodetect the MIME type of the first attachment from the binary data. The detected MIME type is assigned to the _attachment_mimetype field.

The command provides the following configuration options:

Property Name Default Description
includeDefaultMimeTypes true Whether to include the Tika default MIME types file that ships embedded in tika-core.jar (see http://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml)
mimeTypesFiles [] The relative or absolute path of zero or more Tika custom-mimetypes.xml files to include.
mimeTypesString null The content of an optional custom-mimetypes.xml file embedded directly inside of this morphline configuration file.
preserveExisting true Whether to preserve the _attachment_mimetype field value if one is already present.
includeMetaData false Whether to pass the record fields to Tika to assist in MIME type detection.
excludeParameters true Whether to remove MIME parameters from output MIME type.

Example usage:

detectMimeType {
  includeDefaultMimeTypes : false
  #mimeTypesFiles : [src/test/resources/custom-mimetypes.xml]
  mimeTypesString :
    """
    <mime-info>
      <mime-type type="text/space-separated-values">
        <glob pattern="*.ssv"/>
      </mime-type>

      <mime-type type="avro/binary">
        <magic priority="50">
          <match value="0x4f626a01" type="string" offset="0"/>
        </magic>
        <glob pattern="*.avro"/>
      </mime-type>

      <mime-type type="mytwittertest/json+delimited+length">
        <magic priority="50">
          <match value="[0-9]+(\r)?\n\\{&quot;" type="regex" offset="0:16"/>
        </magic>
      </mime-type>
    </mime-info>
    """
}