kite-morphlines-solr-core
This maven module contains morphline commands for Solr that higher level modules such as kite-morphlines-solr-cell, search-mr, and search-flume depend on for indexing.
solrLocator
A solrLocator is a set of configuration parameters that identify the
        location and schema of a Solr server or SolrCloud. Based on this information a morphline
        Solr command can fetch the Solr index schema and send data to Solr. A
          solrLocator is not actually a command but rather a common parameter of
        many morphline Solr commands, and thus described separately here.
Example usage:
solrLocator : {
  # Name of solr collection
  collection : collection1
  # ZooKeeper ensemble
  zkHost : "127.0.0.1:2181/solr"
  # Max number of documents to pass per RPC from morphline to Solr Server
  # batchSize : 10000
}
    loadSolr
The loadSolr command (source code) inserts, updates or deletes records
        into a Solr server or MapReduce Reducer.
The command provides the following configuration options:
| Property Name | Default | Description | 
|---|---|---|
| solrLocator | n/a | Solr location parameters as described separately above. | 
| boosts | [] | An optional JSON object containing zero or more fieldName-boostValue mappings where the fieldName is a String and the boostValue is a float. The default boost is 1.0. | 
Examples:
- loadSolrUpdate
 - loadSolrPartialUpdate
 - loadSolrDeleteById
 - loadSolrDeleteByQuery
 - loadSolrChildDocuments
 
Example loadSolr usage to insert a document or update an existing document stored in Solr ("update")
loadSolr {
  solrLocator : {
    # Name of solr collection
    collection : collection1
    # ZooKeeper ensemble
    zkHost : "127.0.0.1:2181/solr"
    # Max number of docs to pass per RPC from morphline to Solr Server
    # batchSize : 10000
  }
  boosts : {
    id : 2.0 # assign to the id field a boost value 2.0
  }
}
      Example loadSolr usage to update a subset of fields of an existing document stored in Solr ("partial document update"):
java { code : """ 
  // specify the unique key of the document stored in Solr that shall be updated
  record.put("id", 123); 
  // set "first_name" field of stored Solr document to "Nadja"; retain other fields as-is
  record.put("first_name", Collections.singletonMap("set", "Nadja"));
  // set "tags" field of stored Solr document to multiple values ["smart", "creative"]; retain other fields as-is
  record.put("tags", Collections.singletonMap("set", Arrays.asList("smart", "creative")));
  // add "San Francisco" to the existing values of the cities field of the stored Solr document; retain other fields as-is
  record.put("cities", Collections.singletonMap("add", "San Francisco"));
  
  // remove the "text" field from a document stored in Solr; retain other fields as-is
  record.put("text", Collections.singletonMap("set", null));
  
  // increment user_friends_count by 5; retain other fields of stored Solr document as-is
  record.put("user_friends_count", Collections.singletonMap("inc", 5));
  // pass record to next command in chain
  return child.process(record); 
              """
}
loadSolr {
  <solrLocator goes here>
}
      
      Example loadSolr usage for deleteById:
# Tell loadSolr command to delete the documents for which the unique key field equals 123 or 456.
setValues {
  _loadSolr_deleteById:[123, 456]
}
loadSolr {
  <solrLocator goes here>
}
      Example loadSolr usage for deleteByQuery:
# Tell loadSolr command to delete all documents for which the following conditions hold: 
# The city field starts with "Paris" AND the color field equals "blue" OR
# The city field starts with "London" AND the color field equals "purple"
setValues {
  _loadSolr_deleteByQuery:["(city:Paris*)AND(color:blue)", "(city:London*)AND(color:purple)"]
}
loadSolr {
  <solrLocator goes here>
}
      Example loadSolr usage for child documents (aka nested documents):
A record can contain (arbitrarily nested) child documents (aka nested documents aka nested
        records) in the "_loadSolr_childDocuments" morphline record field. If present, these are
        recognized and indexed by the loadSolr command, and the parent-child
        relationships become available to Solr queries, as shown below:
java { 
  code: """
    // Index a document that has a foo child document, which in turn has a bar child document
    record.put("id", "12345");
    record.put("content_type", "parent");
    Record childDoc = new Record();            
    childDoc.put("id", "foo");
    childDoc.put("content_type", "child");
    Record childDoc2 = new Record();
    childDoc2.put("id", "bar");
    childDoc2.put("content_type", "child");
    childDoc.put("_loadSolr_childDocuments", childDoc2); // mark as child doc
    record.put("_loadSolr_childDocuments", childDoc); // mark as child doc
    return child.process(record);
        """ 
  } 
}           
loadSolr {
  <solrLocator goes here>
}
      Example Solr parent block join that returns the parent records for records where the child
        documents contain "bar" in the id field:
{!parent which='content_type:parent'}id:bar
      For more background see this article.
generateSolrSequenceKey
The generateSolrSequenceKey command (source code) assigns a record unique key that is the
        concatenation of the given baseIdField record field, followed by a running
        count of the record number within the current session. The count is reset to zero whenever a
          startSession notification is received.
For example, assume a CSV file containing multiple records but no unique ids, and the
        base_id field is the filesystem path of the file. Now this command can be used to assign the
        following record values to Solr's unique key field: $path#0, $path#1, ...
          $path#N.
The name of the unique key field is fetched from Solr's schema.xml file,
        as directed by the solrLocator configuration parameter.
The command provides the following configuration options:
| Property Name | Default | Description | 
|---|---|---|
| solrLocator | n/a | Solr location parameters as described separately above. | 
| baseIdField | baseid | The name of the input field to use for prefixing keys. | 
| preserveExisting | true | Whether to preserve the field value if one is already present.solrLocator n/a Solr location parameters as described separately above. baseIdField baseid The name of the input field to use for prefixing keys. preserveExisting true Whether to preserve the field value if one is already present. | 
Example usage:
generateSolrSequenceKey {
  baseIdField: ignored_base_id
  solrLocator : ${SOLR_LOCATOR}
}
    sanitizeUnknownSolrFields
The sanitizeUnknownSolrFields command (source code) sanitizes record fields that are
        unknown to Solr schema.xml by either deleting them
          (renameToPrefix parameter is absent or a zero length string) or by moving
        them to a field prefixed with the given renameToPrefix (for example, to use
        typical dynamic Solr fields).
Recall that Solr throws an exception on any attempt to load a document that contains a
        field that is not specified in schema.xml.
The command provides the following configuration options:
| Property Name | Default | Description | 
|---|---|---|
| solrLocator | n/a | Solr location parameters as described separately above. | 
| renameToPrefix | "" | Output field prefix for unknown fields. | 
Example usage:
sanitizeUnknownSolrFields {
  solrLocator : ${SOLR_LOCATOR}
}
    tokenizeText
The tokenizeText command (source code) uses the embedded Solr/Lucene Analyzer library to generate tokens from a text
        string, without sending data to a Solr server.
This is useful for prototyping and debugging Solr applications. It is also useful for standalone usage outside of Solr, e.g. for extracting text features from documents for use with recommendation systems, clustering and classification applications.
The command provides the following configuration options:
| Property Name | Default | Description | 
|---|---|---|
| solrLocator | n/a | Solr location parameters as described separately above. | 
| inputField | n/a | The name of the input field. | 
| outputField | n/a | The name of the field to add output values to. | 
| solrFieldType | n/a | The name of the Solr field type in schema.xml to use for text analysis and tokenization. This parameter specifies the algorithmic extraction rules. Example: "text_en" | 
Example usage:
tokenizeText {
  inputField : message
  outputField : tokens
  solrFieldType : text_en
  solrLocator : {
    # Name of solr collection
    collection : collection1
    # ZooKeeper ensemble
    zkHost : "127.0.0.1:2181/solr"
    
    # solrHomeDir : "example/solr/collection1"    
  }
}
      For example, given the input field message with the value Hello
          World!\nFoo@Bar.com #%()123 the expected output record is:
tokens:hello
tokens:world
tokens:foo
tokens:bar.com
tokens:123
      This example assumes the Solr field type named "text_en" is defined in
          schema.xml as shown in the following snippet:
...
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
       ignoreCase="true"
       words="lang/stopwords_en.txt"
       enablePositionIncrements="true"
    />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>
    