kite-morphlines-solr-cell
This maven module contains morphline commands for using SolrCell with Tika parsers. This includes support for types including HTML, XML, PDF, Word, Excel, Images, Audio, and Video.
solrCell
The solrCell
command (source code) pipes the first attachment of a record
into one of the given Apache Tika parsers, then maps the Tika output back to a record using
Apache SolrCell.
The Tika parser is chosen from the configurable list of parsers, depending on the MIME type specified in the input record. Typically, this requires an upstream detectMimeType command.
The command provides the following configuration options:
Property Name | Default | Description |
---|---|---|
solrLocator | n/a | Solr location parameters as described separately above. |
capture | [] | List of XHTML element names to extract from the Tika output. For instance, it could be used to grab paragraphs (<p>) and index them into a separate field. Note that content is also still captured into the overall "content" field. |
fmaps | [] | Maps (moves) one field name to another. See the example below. |
uprefix | null | The uprefix option indicates that the command shall prefix all fields that are not defined in the Solr schema.xml with the given prefix. Recall that Solr throws an exception on any attempt to load a document that contains a field that is not specified in schema.xml. The uprefix option is very useful when combined with dynamic field definitions. For example, uprefix : ignored_ would effectively ignore all unknown fields generated by Tika if the schema.xml contains the following dynamic field definition: dynamicField name="ignored_*" type="ignored" |
captureAttr | false | Whether to index attributes of the Tika XHTML elements into separate fields, named after the element. For example, when extracting from HTML, Tika can return the href attributes in <a> tags as fields named "a". |
xpath | null | When extracting, only return Tika XHTML content that satisfies the XPath expression. See http://tika.apache.org/1.4/parser.html for details on the format of Tika XHTML. See also http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput. |
lowernames | false | Map all field names to lowercase with underscores. For example, Content-Type would be mapped to content_type. |
solrContentHandlerFactory | org.kitesdk.morphline. solrcell.TrimSolrContentHandlerFactory | A Java class to handle bridging from Tika to SolrCell. |
parsers | [] | List of fully qualified Java class names of one or more Tika parsers. |
Example usage:
solrCell {
solrLocator : ${SOLR_LOCATOR}
# extract some fields
capture : [content, a, h1, h2]
# rename exif_image_height field to text field
# rename a field to anchor field
# rename h1 field to heading1 field
fmap : { exif_image_height : text, a : anchor, h1 : heading1 }
# xpath : "/xhtml:html/xhtml:body/xhtml:div/descendant:node()"
parsers : [ # one or more nested Tika parsers
{ parser : org.apache.tika.parser.jpeg.JpegParser }
]
}
Here is a complex morphline that demonstrates integrating multiple heterogenous input file
formats via a tryRules
command, including Avro and SolrCell, using auto
detection of MIME types via detectMimeType
command, recursion via the
callParentPipe
command for unwrapping container formats, and automatic
UUID generation:
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{
# emit one output record for each attachment in the input
# record's list of attachments. The result is a list of
# records, each of which has at most one attachment.
separateAttachments {}
}
{
# used for auto-detection if MIME type isn't explicitly supplied
detectMimeType {
includeDefaultMimeTypes : true
mimeTypesFiles : [target/test-classes/custom-mimetypes.xml]
}
}
{
tryRules {
throwExceptionIfAllRulesFailed : true
rules : [
# next rule of tryRules cmd:
{
commands : [
{ logDebug { format : "hello unpack" } }
{ unpack {} }
{ generateUUID {} }
{ callParentPipe {} }
]
}
# next rule of tryRules cmd:
{
commands : [
{ logDebug { format : "hello decompress" } }
{ decompress {} }
{ callParentPipe {} }
]
}
# next rule of tryRules cmd:
{
commands : [
{
readAvroContainer {
supportedMimeTypes : [avro/binary]
# optional, avro json schema blurb for getSchema()
# readerSchemaString : "<json can go here>"
# readerSchemaFile : /path/to/syslog.avsc
}
}
{ extractAvroTree {} }
{
setValues {
id : "@{/id}"
user_screen_name : "@{/user_screen_name}"
text : "@{/text}"
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
# next rule of tryRules cmd:
{
commands : [
{
readJsonTestTweets {
supportedMimeTypes : ["mytwittertest/json+delimited+length"]
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
# next rule of tryRules cmd:
{
commands : [
{ logDebug { format : "hello solrcell" } }
{
# wrap SolrCell around an Tika parsers
solrCell {
solrLocator : ${SOLR_LOCATOR}
capture : [
# twitter feed schema
user_friends_count
user_location
user_description
user_statuses_count
user_followers_count
user_name
user_screen_name
created_at
text
retweet_count
retweeted
in_reply_to_user_id
source
in_reply_to_status_id
media_url_https
expanded_url
]
# rename "content" field to "text" fields
fmap : { content : text, content-type : content_type }
lowernames : true
# Tika parsers to be registered:
parsers : [
# { parser : org.apache.tika.parser.AutoDetectParser }
{ parser : org.apache.tika.parser.asm.ClassParser }
{ parser : org.gagravarr.tika.FlacParser }
{ parser : org.apache.tika.parser.audio.AudioParser }
{ parser : org.apache.tika.parser.audio.MidiParser }
{ parser : org.apache.tika.parser.crypto.Pkcs7Parser }
{ parser : org.apache.tika.parser.dwg.DWGParser }
{ parser : org.apache.tika.parser.epub.EpubParser }
{ parser : org.apache.tika.parser.executable.ExecutableParser }
{ parser : org.apache.tika.parser.feed.FeedParser }
{ parser : org.apache.tika.parser.font.AdobeFontMetricParser }
{ parser : org.apache.tika.parser.font.TrueTypeParser }
{ parser : org.apache.tika.parser.xml.XMLParser }
{ parser : org.apache.tika.parser.html.HtmlParser }
{ parser : org.apache.tika.parser.image.ImageParser }
{ parser : org.apache.tika.parser.image.PSDParser }
{ parser : org.apache.tika.parser.image.TiffParser }
{ parser : org.apache.tika.parser.iptc.IptcAnpaParser }
{ parser : org.apache.tika.parser.iwork.IWorkPackageParser }
{ parser : org.apache.tika.parser.jpeg.JpegParser }
{ parser : org.apache.tika.parser.mail.RFC822Parser }
{ parser : org.apache.tika.parser.mbox.MboxParser,
additionalSupportedMimeTypes : [message/x-emlx] }
{ parser : org.apache.tika.parser.microsoft.OfficeParser }
{ parser : org.apache.tika.parser.microsoft.TNEFParser }
{ parser : org.apache.tika.parser.microsoft.ooxml.OOXMLParser }
{ parser : org.apache.tika.parser.mp3.Mp3Parser }
{ parser : org.apache.tika.parser.mp4.MP4Parser }
{ parser : org.apache.tika.parser.hdf.HDFParser }
{ parser : org.apache.tika.parser.netcdf.NetCDFParser }
{ parser : org.apache.tika.parser.odf.OpenDocumentParser }
{ parser : org.apache.tika.parser.pdf.PDFParser }
{ parser : org.apache.tika.parser.pkg.CompressorParser }
{ parser : org.apache.tika.parser.pkg.PackageParser }
{ parser : org.apache.tika.parser.rtf.RTFParser }
{ parser : org.apache.tika.parser.txt.TXTParser }
{ parser : org.apache.tika.parser.video.FLVParser }
{ parser : org.apache.tika.parser.xml.DcXMLParser }
{ parser : org.apache.tika.parser.xml.FictionBookParser }
{ parser : org.apache.tika.parser.chm.ChmParser }
]
}
}
{ generateUUID { field : ignored_base_id } }
{
generateSolrSequenceKey {
baseIdField: ignored_base_id
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
{
logDebug {
format : "My output record: {}"
args : ["@{}"]
}
}
]
}
]
Note: More information on SolrCell can be found here: http://wiki.apache.org/solr/ExtractingRequestHandler