kite-morphlines-solr-core

This maven module contains morphline commands for Solr that higher level modules such as kite-morphlines-solr-cell, search-mr, and search-flume depend on for indexing.

solrLocator

A solrLocator is a set of configuration parameters that identify the location and schema of a Solr server or SolrCloud. Based on this information a morphline Solr command can fetch the Solr index schema and send data to Solr. A solrLocator is not actually a command but rather a common parameter of many morphline Solr commands, and thus described separately here.

Example usage:

solrLocator : {
  # Name of solr collection
  collection : collection1

  # ZooKeeper ensemble
  zkHost : "127.0.0.1:2181/solr"

  # Max number of documents to pass per RPC from morphline to Solr Server
  # batchSize : 10000
}

loadSolr

The loadSolr command (source code) inserts, updates or deletes records into a Solr server or MapReduce Reducer.

The command provides the following configuration options:

Property Name Default Description
solrLocator n/a Solr location parameters as described separately above.
boosts [] An optional JSON object containing zero or more fieldName-boostValue mappings where the fieldName is a String and the boostValue is a float. The default boost is 1.0.

Examples:

Example loadSolr usage to insert a document or update an existing document stored in Solr ("update")

loadSolr {
  solrLocator : {
    # Name of solr collection
    collection : collection1

    # ZooKeeper ensemble
    zkHost : "127.0.0.1:2181/solr"

    # Max number of docs to pass per RPC from morphline to Solr Server
    # batchSize : 10000
  }
  boosts : {
    id : 2.0 # assign to the id field a boost value 2.0
  }
}

Example loadSolr usage to update a subset of fields of an existing document stored in Solr ("partial document update"):

java { code : """ 
  // specify the unique key of the document stored in Solr that shall be updated
  record.put("id", 123); 

  // set "first_name" field of stored Solr document to "Nadja"; retain other fields as-is
  record.put("first_name", Collections.singletonMap("set", "Nadja"));

  // set "tags" field of stored Solr document to multiple values ["smart", "creative"]; retain other fields as-is
  record.put("tags", Collections.singletonMap("set", Arrays.asList("smart", "creative")));

  // add "San Francisco" to the existing values of the cities field of the stored Solr document; retain other fields as-is
  record.put("cities", Collections.singletonMap("add", "San Francisco"));
  
  // remove the "text" field from a document stored in Solr; retain other fields as-is
  record.put("text", Collections.singletonMap("set", null));
  
  // increment user_friends_count by 5; retain other fields of stored Solr document as-is
  record.put("user_friends_count", Collections.singletonMap("inc", 5));

  // pass record to next command in chain
  return child.process(record); 
              """
}

loadSolr {
  <solrLocator goes here>
}

Example loadSolr usage for deleteById:

# Tell loadSolr command to delete the documents for which the unique key field equals 123 or 456.
setValues {
  _loadSolr_deleteById:[123, 456]
}

loadSolr {
  <solrLocator goes here>
}

Example loadSolr usage for deleteByQuery:

# Tell loadSolr command to delete all documents for which the following conditions hold: 
# The city field starts with "Paris" AND the color field equals "blue" OR
# The city field starts with "London" AND the color field equals "purple"
setValues {
  _loadSolr_deleteByQuery:["(city:Paris*)AND(color:blue)", "(city:London*)AND(color:purple)"]
}

loadSolr {
  <solrLocator goes here>
}

Example loadSolr usage for child documents (aka nested documents):

A record can contain (arbitrarily nested) child documents (aka nested documents aka nested records) in the "_loadSolr_childDocuments" morphline record field. If present, these are recognized and indexed by the loadSolr command, and the parent-child relationships become available to Solr queries, as shown below:

java { 
  code: """
    // Index a document that has a foo child document, which in turn has a bar child document
    record.put("id", "12345");
    record.put("content_type", "parent");
    Record childDoc = new Record();            
    childDoc.put("id", "foo");
    childDoc.put("content_type", "child");
    Record childDoc2 = new Record();
    childDoc2.put("id", "bar");
    childDoc2.put("content_type", "child");
    childDoc.put("_loadSolr_childDocuments", childDoc2); // mark as child doc
    record.put("_loadSolr_childDocuments", childDoc); // mark as child doc
    return child.process(record);
        """ 
  } 
}           

loadSolr {
  <solrLocator goes here>
}

Example Solr parent block join that returns the parent records for records where the child documents contain "bar" in the id field:

{!parent which='content_type:parent'}id:bar

For more background see this article.

generateSolrSequenceKey

The generateSolrSequenceKey command (source code) assigns a record unique key that is the concatenation of the given baseIdField record field, followed by a running count of the record number within the current session. The count is reset to zero whenever a startSession notification is received.

For example, assume a CSV file containing multiple records but no unique ids, and the base_id field is the filesystem path of the file. Now this command can be used to assign the following record values to Solr's unique key field: $path#0, $path#1, ... $path#N.

The name of the unique key field is fetched from Solr's schema.xml file, as directed by the solrLocator configuration parameter.

The command provides the following configuration options:

Property Name Default Description
solrLocator n/a Solr location parameters as described separately above.
baseIdField baseid The name of the input field to use for prefixing keys.
preserveExisting true Whether to preserve the field value if one is already present.solrLocator n/a Solr location parameters as described separately above. baseIdField baseid The name of the input field to use for prefixing keys. preserveExisting true Whether to preserve the field value if one is already present.

Example usage:

generateSolrSequenceKey {
  baseIdField: ignored_base_id
  solrLocator : ${SOLR_LOCATOR}
}

sanitizeUnknownSolrFields

The sanitizeUnknownSolrFields command (source code) sanitizes record fields that are unknown to Solr schema.xml by either deleting them (renameToPrefix parameter is absent or a zero length string) or by moving them to a field prefixed with the given renameToPrefix (for example, to use typical dynamic Solr fields).

Recall that Solr throws an exception on any attempt to load a document that contains a field that is not specified in schema.xml.

The command provides the following configuration options:

Property Name Default Description
solrLocator n/a Solr location parameters as described separately above.
renameToPrefix "" Output field prefix for unknown fields.

Example usage:

sanitizeUnknownSolrFields {
  solrLocator : ${SOLR_LOCATOR}
}

tokenizeText

The tokenizeText command (source code) uses the embedded Solr/Lucene Analyzer library to generate tokens from a text string, without sending data to a Solr server.

This is useful for prototyping and debugging Solr applications. It is also useful for standalone usage outside of Solr, e.g. for extracting text features from documents for use with recommendation systems, clustering and classification applications.

The command provides the following configuration options:

Property Name Default Description
solrLocator n/a Solr location parameters as described separately above.
inputField n/a The name of the input field.
outputField n/a The name of the field to add output values to.
solrFieldType n/a The name of the Solr field type in schema.xml to use for text analysis and tokenization. This parameter specifies the algorithmic extraction rules. Example: "text_en"

Example usage:

tokenizeText {
  inputField : message
  outputField : tokens
  solrFieldType : text_en
  solrLocator : {
    # Name of solr collection
    collection : collection1

    # ZooKeeper ensemble
    zkHost : "127.0.0.1:2181/solr"
    
    # solrHomeDir : "example/solr/collection1"    
  }
}

For example, given the input field message with the value Hello World!\nFoo@Bar.com #%()123 the expected output record is:

tokens:hello
tokens:world
tokens:foo
tokens:bar.com
tokens:123

This example assumes the Solr field type named "text_en" is defined in schema.xml as shown in the following snippet:

...
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
       ignoreCase="true"
       words="lang/stopwords_en.txt"
       enablePositionIncrements="true"
    />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>