kite-morphlines-solr-cell

This maven module contains morphline commands for using SolrCell with Tika parsers. This includes support for types including HTML, XML, PDF, Word, Excel, Images, Audio, and Video.

solrCell

The solrCell command (source code) pipes the first attachment of a record into one of the given Apache Tika parsers, then maps the Tika output back to a record using Apache SolrCell.

The Tika parser is chosen from the configurable list of parsers, depending on the MIME type specified in the input record. Typically, this requires an upstream detectMimeType command.

The command provides the following configuration options:

Property Name Default Description
solrLocator n/a Solr location parameters as described separately above.
capture [] List of XHTML element names to extract from the Tika output. For instance, it could be used to grab paragraphs (<p>) and index them into a separate field. Note that content is also still captured into the overall "content" field.
fmaps [] Maps (moves) one field name to another. See the example below.
uprefix null The uprefix option indicates that the command shall prefix all fields that are not defined in the Solr schema.xml with the given prefix. Recall that Solr throws an exception on any attempt to load a document that contains a field that is not specified in schema.xml. The uprefix option is very useful when combined with dynamic field definitions. For example, uprefix : ignored_ would effectively ignore all unknown fields generated by Tika if the schema.xml contains the following dynamic field definition: dynamicField name="ignored_*" type="ignored"
captureAttr false Whether to index attributes of the Tika XHTML elements into separate fields, named after the element. For example, when extracting from HTML, Tika can return the href attributes in <a> tags as fields named "a".
xpath null When extracting, only return Tika XHTML content that satisfies the XPath expression. See http://tika.apache.org/1.4/parser.html for details on the format of Tika XHTML. See also http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput.
lowernames false Map all field names to lowercase with underscores. For example, Content-Type would be mapped to content_type.
solrContentHandlerFactory org.kitesdk.morphline. solrcell.TrimSolrContentHandlerFactory A Java class to handle bridging from Tika to SolrCell.
parsers [] List of fully qualified Java class names of one or more Tika parsers.

Example usage:

solrCell {
  solrLocator : ${SOLR_LOCATOR}

  # extract some fields
  capture : [content, a, h1, h2]

  # rename exif_image_height field to text field
  # rename a field to anchor field
  # rename h1 field to heading1 field
  fmap : { exif_image_height : text, a : anchor, h1 : heading1 }

  # xpath : "/xhtml:html/xhtml:body/xhtml:div/descendant:node()"

  parsers : [ # one or more nested Tika parsers
    { parser : org.apache.tika.parser.jpeg.JpegParser }
  ]
}

Here is a complex morphline that demonstrates integrating multiple heterogenous input file formats via a tryRules command, including Avro and SolrCell, using auto detection of MIME types via detectMimeType command, recursion via the callParentPipe command for unwrapping container formats, and automatic UUID generation:

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands : [
      {
        # emit one output record for each attachment in the input
        # record's list of attachments. The result is a list of
        # records, each of which has at most one attachment.
        separateAttachments {}
      }

      {
        # used for auto-detection if MIME type isn't explicitly supplied
        detectMimeType {
          includeDefaultMimeTypes : true
          mimeTypesFiles : [target/test-classes/custom-mimetypes.xml]
        }
      }

      {
        tryRules {
          throwExceptionIfAllRulesFailed : true
          rules : [
            # next rule of tryRules cmd:
            {
              commands : [
                { logDebug { format : "hello unpack" } }
                { unpack {} }
                { generateUUID {} }
                { callParentPipe {} }
              ]
            }

            # next rule of tryRules cmd:
            {
              commands : [
                { logDebug { format : "hello decompress" } }
                { decompress {} }
                { callParentPipe {} }
              ]
            }

            # next rule of tryRules cmd:
            {
              commands : [
                {
                  readAvroContainer {
                    supportedMimeTypes : [avro/binary]
                    # optional, avro json schema blurb for getSchema()
                    # readerSchemaString : "<json can go here>"
                    # readerSchemaFile : /path/to/syslog.avsc
                  }
                }

                { extractAvroTree {} }

                {
                  setValues {
                    id : "@{/id}"
                    user_screen_name : "@{/user_screen_name}"
                    text : "@{/text}"
                  }
                }

                {
                  sanitizeUnknownSolrFields {
                    solrLocator : ${SOLR_LOCATOR}
                  }
                }
              ]
            }

            # next rule of tryRules cmd:
            {
              commands : [
                {
                  readJsonTestTweets {
                    supportedMimeTypes : ["mytwittertest/json+delimited+length"]
                  }
                }

                {
                  sanitizeUnknownSolrFields {
                    solrLocator : ${SOLR_LOCATOR}
                  }
                }
              ]
            }

            # next rule of tryRules cmd:
            {
              commands : [
                { logDebug { format : "hello solrcell" } }
                {
                  # wrap SolrCell around an Tika parsers
                  solrCell {
                    solrLocator : ${SOLR_LOCATOR}

                    capture : [
                      # twitter feed schema
                      user_friends_count
                      user_location
                      user_description
                      user_statuses_count
                      user_followers_count
                      user_name
                      user_screen_name
                      created_at
                      text
                      retweet_count
                      retweeted
                      in_reply_to_user_id
                      source
                      in_reply_to_status_id
                      media_url_https
                      expanded_url
                     ]

                    # rename "content" field to "text" fields
                    fmap : { content : text, content-type : content_type }

                    lowernames : true

                    # Tika parsers to be registered:
                    parsers : [
                      # { parser : org.apache.tika.parser.AutoDetectParser }
                      { parser : org.apache.tika.parser.asm.ClassParser }
                      { parser : org.gagravarr.tika.FlacParser }
                      { parser : org.apache.tika.parser.audio.AudioParser }
                      { parser : org.apache.tika.parser.audio.MidiParser }
                      { parser : org.apache.tika.parser.crypto.Pkcs7Parser }
                      { parser : org.apache.tika.parser.dwg.DWGParser }
                      { parser : org.apache.tika.parser.epub.EpubParser }
                      { parser : org.apache.tika.parser.executable.ExecutableParser }
                      { parser : org.apache.tika.parser.feed.FeedParser }
                      { parser : org.apache.tika.parser.font.AdobeFontMetricParser }
                      { parser : org.apache.tika.parser.font.TrueTypeParser }
                      { parser : org.apache.tika.parser.xml.XMLParser }
                      { parser : org.apache.tika.parser.html.HtmlParser }
                      { parser : org.apache.tika.parser.image.ImageParser }
                      { parser : org.apache.tika.parser.image.PSDParser }
                      { parser : org.apache.tika.parser.image.TiffParser }
                      { parser : org.apache.tika.parser.iptc.IptcAnpaParser }
                      { parser : org.apache.tika.parser.iwork.IWorkPackageParser }
                      { parser : org.apache.tika.parser.jpeg.JpegParser }
                      { parser : org.apache.tika.parser.mail.RFC822Parser }
                      { parser : org.apache.tika.parser.mbox.MboxParser,
                          additionalSupportedMimeTypes : [message/x-emlx] }
                      { parser : org.apache.tika.parser.microsoft.OfficeParser }
                      { parser : org.apache.tika.parser.microsoft.TNEFParser }
                      { parser : org.apache.tika.parser.microsoft.ooxml.OOXMLParser }
                      { parser : org.apache.tika.parser.mp3.Mp3Parser }
                      { parser : org.apache.tika.parser.mp4.MP4Parser }
                      { parser : org.apache.tika.parser.hdf.HDFParser }
                      { parser : org.apache.tika.parser.netcdf.NetCDFParser }
                      { parser : org.apache.tika.parser.odf.OpenDocumentParser }
                      { parser : org.apache.tika.parser.pdf.PDFParser }
                      { parser : org.apache.tika.parser.pkg.CompressorParser }
                      { parser : org.apache.tika.parser.pkg.PackageParser }
                      { parser : org.apache.tika.parser.rtf.RTFParser }
                      { parser : org.apache.tika.parser.txt.TXTParser }
                      { parser : org.apache.tika.parser.video.FLVParser }
                      { parser : org.apache.tika.parser.xml.DcXMLParser }
                      { parser : org.apache.tika.parser.xml.FictionBookParser }
                      { parser : org.apache.tika.parser.chm.ChmParser }
                    ]
                  }
                }

                { generateUUID { field : ignored_base_id } }

                {
                  generateSolrSequenceKey {
                    baseIdField: ignored_base_id
                    solrLocator : ${SOLR_LOCATOR}
                  }
                }

              ]
            }
          ]
        }
      }

      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }

      {
        logDebug {
          format : "My output record: {}"
          args : ["@{}"]
        }
      }

    ]
  }
]

Note: More information on SolrCell can be found here: http://wiki.apache.org/solr/ExtractingRequestHandler