Using the Lily HBase NRT Indexer Service

To index for column families of tables in an HBase cluster:

  • Enable replication on HBase column families
  • Create collections and configurations
  • Register a Lily HBase Indexer configuration with the Lily HBase Indexer Service
  • Verify that indexing is working

Enabling Replication on HBase Column Families

Ensure that cluster-wide HBase replication is enabled, as described in Enabling Cluster-wide HBase Replication. Use the HBase shell to define column-family replication settings.

For every existing table, set the REPLICATION_SCOPE on every column family that you want to index:

hbase shell
hbase shell> disable 'sample_table'
hbase shell> alter 'sample_table', {NAME => 'columnfamily1', REPLICATION_SCOPE => 1}
hbase shell> enable 'sample_table'

For every new table, set the REPLICATION_SCOPE on every column family that you want to index using a command such as the following:

hbase shell
hbase shell> create 'test_table', {NAME => 'testcolumnfamily', REPLICATION_SCOPE => 1}

Creating a Collection in Cloudera Search

A collection in Search used for HBase indexing must have a Solr schema that accommodates the types of HBase column families and qualifiers that are being indexed. To begin, consider adding the all-inclusive data field to a default schema. Once you decide on a schema, create a collection using commands similar to the following:

solrctl instancedir --generate $HOME/hbase_collection_config
## Edit $HOME/hbase_collection_config/conf/schema.xml as needed ##
## If you are using Sentry for authorization, copy solrconfig.xml.secure to solrconfig.xml as follows: ##
## cp $HOME/hbase_collection_config/conf/solrconfig.xml.secure $HOME/hbase_collection_config/conf/solrconfig.xml ##
solrctl instancedir --create hbase_collection_config $HOME/hbase_collection_config
solrctl collection --create hbase_collection -s <numShards> -c hbase_collection_config

Creating a Lily HBase Indexer Configuration File

Configure individual Lily HBase Indexers using the hbase-indexer command-line utility. Typically, there is one Lily HBase Indexer configuration file for each HBase table, but there can be as many Lily HBase Indexer configuration files as there are tables, column families, and corresponding collections in Search. Each Lily HBase Indexer configuration is defined in an XML file, such as morphline-hbase-mapper.xml.

An indexer configuration XML file must refer to the MorphlineResultToSolrMapper implementation and point to the location of a Morphline configuration file, as shown in the following morphline-hbase-mapper.xml indexer configuration file. For Cloudera Manager managed environments, set morphlineFile to the relative path morphlines.conf. For unmanaged environments, specify the absolute path to a morphlines.conf that exists on the Lily HBase Indexer host. Make sure the file is readable by the HBase system user (hbase by default).

$ cat $HOME/morphline-hbase-mapper.xml

<?xml version="1.0"?>
<indexer table="sample_table"
mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

   <!-- The relative or absolute path on the local file system to the
   morphline configuration file. -->
   <!-- Use relative path "morphlines.conf" for morphlines managed by
   Cloudera Manager -->
   <param name="morphlineFile" value="/path/to/morphlines.conf"/>

   <!-- The optional morphlineId identifies a morphline if there are multiple
   morphlines in morphlines.conf -->
   <!-- <param name="morphlineId" value="morphline1"/> -->

</indexer>

The Lily HBase Indexer configuration file also supports the standard attributes of any HBase Lily Indexer on the top-level <indexer> element: table, mapping-type, read-row, mapper, unique-key-formatter, unique-key-field, row-field, column-family-field, andtable-family-field. It does not support the <field> element and <extract> elements.

Creating a Morphline Configuration File

After creating an indexer configuration XML file, you can configure morphline ETL transformation commands in a morphlines.conf configuration file. The morphlines.conf configuration file can contain any number of morphline commands. Typically, an extractHBaseCells command is the first command. The readAvroContainer or readAvro morphline commands are often used to extract Avro data from the HBase byte array. This configuration file can be shared among different applications that use morphlines.

If you are using Cloudera Manager, the morphlines.conf file is edited within Cloudera Manager (Key-Value Store Indexer service > Configuration > Category > Morphlines > Morphlines File).

For unmanaged environments, create the morphlines.conf file on the Lily HBase Indexer host:

cat /etc/hbase-solr/conf/morphlines.conf

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

    commands : [
      {
        extractHBaseCells {
          mappings : [
            {
              inputColumn : "data:*"
              outputField : "data"
              type : string
              source : value
            }

            #{
            #  inputColumn : "data:item"
            #  outputField : "_attachment_body"
            #  type : "byte[]"
            #  source : value
            #}
          ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      #{ readAvroContainer {} }
      #{
      #  extractAvroPaths {
      #    paths : {
      #      data : /user_name
      #    }
      #  }
      #}

      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
  }
]

Understanding the extractHBaseCells Morphline Command

The extractHBaseCells morphline command extracts cells from an HBase result and transforms the values into a SolrInputDocument. The command consists of an array of zero or more mapping specifications.

Each mapping has:

  • The inputColumn parameter, which specifies the data from HBase for populating a field in Solr. It has the form of a column family name and qualifier, separated by a colon. The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions are used. The following are examples of valid inputColumn values:
    • mycolumnfamily:myqualifier
    • mycolumnfamily:my*
    • mycolumnfamily:*
  • The outputField parameter specifies the morphline record field to which to add output values. The morphline record field is also known as the Solr document field. Example: first_name.
  • Dynamic output fields are enabled by the outputField parameter ending with a wildcard (*). For example:
    inputColumn : "mycolumnfamily:*"
    outputField : "belongs_to_*"
    In this case, if you make these puts in HBase:
    put 'table_name' , 'row1' , 'mycolumnfamily:1' , 'foo'
    put 'table_name' , 'row1' , 'mycolumnfamily:9' , 'bar'
    Then the fields of the Solr document are as follows:
    belongs_to_1 : foo
    belongs_to_9 : bar
  • The type parameter defines the data type of the content in HBase. All input data is stored in HBase as byte arrays, but all content in Solr is indexed as text, so a method for converting byte arrays to the actual data type is required. The type parameter can be the name of a type that is supported by org.apache.hadoop.hbase.util.Bytes.to* (which currently includes byte[], int, long, string, boolean, float, double, short, and bigdecimal). Use type byte[] to pass the byte array through to the morphline without conversion.
    • type:byte[] copies the byte array unmodified into the record output field
    • type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt
    • type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong
    • type:string converts with org.apache.hadoop.hbase.util.Bytes.toString
    • type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean
    • type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat
    • type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble
    • type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort
    • type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
    Alternatively, the type parameter can be the name of a Java class that implements the com.ngdata.hbaseindexer.parse.ByteArrayValueMapper interface.
    HBase data formatting does not always match what is specified by org.apache.hadoop.hbase.util.Bytes.*. For example, this can occur with data of type float or double. You can enable indexing of such HBase data by converting the data. There are various ways to do so, including:
    • Using Java morphline command to parse input data, converting it to the expected output. For example:
      {
       imports : "import java.util.*;" code: """ // manipulate the contents of a record field
       String stringAmount = (String) record.getFirstValue("amount");
       Double dbl = Double.parseDouble(stringAmount); record.replaceValues("amount",dbl);
       return child.process(record); // pass record to next command in chain """
      }
    • Creating table fields with binary format and then using types such as double or float in a morphline.conf. You could create a table in HBase for storing doubles using commands similar to:
      CREATE TABLE sample_lily_hbase ( id string, amount double, ts timestamp )
      STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,ti:amount#b,ti:ts,')
      TBLPROPERTIES ('hbase.table.name' = 'sample_lily'); 
  • The source parameter determines which portion of an HBase KeyValue is used as indexing input. Valid choices are value or qualifier. When value is specified, the HBase cell value is used as input for indexing. When qualifier is specified, then the HBase column qualifier is used as input for indexing. The default is value.

Registering a Lily HBase Indexer Configuration with the Lily HBase Indexer Service

When the content of the Lily HBase Indexer configuration XML file is satisfactory, register it with the Lily HBase Indexer Service. Register the Lily HBase Indexer configuration file by uploading the Lily HBase Indexer configuration XML file to ZooKeeper. For example:

  1. If your cluster has security enabled, create a Java Authentication and Authorization Service (JAAS) configuration file named jaas.conf in your home directory with the following contents:
    Client {
      com.sun.security.auth.module.Krb5LoginModule required
      useKeyTab=false
      useTicketCache=true
        principal="jdoe@EXAMPLE.COM";
    };

    Replace jdoe@EXAMPLE.COM with your user principal. Your user account must have WRITE permission to create an indexer. For more information, see Configuring the Lily HBase Indexer Service to Use the Sentry Service.

  2. If your cluster has security enabled, authenticate with the user principal specified in your jaas.conf file:
    kinit jdoe@EXAMPLE.COM
  3. Run the following command to register your indexer configuration file with the indexer service. If you have enabled Sentry authorization, add the --http http://lily01.example.com:11060/indexer/ parameter, replacing lily01.example.com with your Lily HBase Indexer hostname:
hbase-indexer add-indexer \
--name myIndexer \
--indexer-conf $HOME/morphline-hbase-mapper.xml \
--connection-param solr.zk=zk01.example.com,zk02.example.com,zk03.example.com/solr \
--connection-param solr.collection=hbase_collection \
--zookeeper zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181

Verify that the indexer was successfully created as follows. If you have enabled Sentry authorization, add the --http http://lily01.example.com:11060/indexer/ parameter, replacing lily01.example.com with your Lily HBase Indexer hostname:

hbase-indexer list-indexers -zookeeper zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181
Number of indexes: 1

myIndexer
  + Lifecycle state: ACTIVE
  + Incremental indexing state: SUBSCRIBE_AND_CONSUME
  + Batch indexing state: INACTIVE
  + SEP subscription ID: Indexer_myIndexer
  + SEP subscription timestamp: 2013-06-12T11:23:35.635-07:00
  + Connection type: solr
  + Connection params:
    + solr.collection = hbase-collection1
    + solr.zk = localhost/solr
  + Indexer config:
      110 bytes, use -dump to see content
  + Batch index config:
      (none)
  + Default batch index config:
      (none)
  + Processes
    + 1 running processes
    + 0 failed processes

Use the update-indexer and delete-indexer command-line options of the hbase-indexer utility to manipulate existing Lily HBase Indexers. If you have enabled Sentry authorization, you must include the --http http://lily01.example.com:11060/indexer/ parameter in all commands.

For more help, use the following commands:

hbase-indexer add-indexer --help
hbase-indexer list-indexers --help
hbase-indexer update-indexer --help
hbase-indexer delete-indexer --help

The morphlines.conf configuration file must be present on every host that runs an indexer. For Cloudera Manager environments, this is handled automatically.

Morphline configuration files can be changed without re-creating the indexer itself, but you must restart the Lily HBase Indexer service for the changes to take effect.

Verifying that Indexing Works

Add rows to the indexed HBase table. For example:

hbase shell
hbase(main):001:0> put 'sample_table', 'row1', 'data', 'value'
hbase(main):002:0> put 'sample_table', 'row2', 'data', 'value2'

If the put operation succeeds, wait a few seconds, go to the SolrCloud UI query page, and query the data. Note the updated rows in Solr.

To print diagnostic information, such as the content of records as they pass through the morphline commands, enable the TRACE log level by adding the following to your log4j.properties file:

log4j.logger.org.kitesdk.morphline=TRACE
log4j.logger.com.ngdata=TRACE

For Cloudera Manager environments, the logging configuration is modified as follows:

  1. Go to Key-Value Store Indexer service > Configuration > Category > Advanced.
  2. Find the Lily HBase Indexer Logging Advanced Configuration Snippet (Safety Valve) property or search for it by typing its name in the Search box.
  3. Click Save Changes.
  4. Restart the service (Key-Value Store Indexer service > Actions > Restart).

Examine the log files in /var/log/hbase-solr/lily-hbase-indexer-* for details.

Using the Indexer HTTP Interface

Beginning with CDH 5.4 the Lily HBase Indexer includes an HTTP interface for the list-indexers, create-indexer, update-indexer, and delete-indexer commands. This interface can be secured with Kerberos for authentication and Apache Sentry for authorization. For information on configuring security for the Lily HBase Indexer service, see Configuring Lily HBase Indexer Security.

By default, the hbase-indexercommand line client does not use the HTTP interface. Use the HTTP interface to take advantage of the features it provides, such as Kerberos authentication and Sentry integration. The hbase-indexer command supports two additional parameters to the list-indexers, create-indexer, delete-indexer, and update-indexer commands:

  • --http: An HTTP URI for the HTTP interface. By default, this URI is of the form http://lily01.example.com:11060/indexer/. If this parameter is specified, the Lily HBase Indexer uses the HTTP API. If this parameter is not specified, the indexer communicates directly with ZooKeeper.
  • --jaas: Specifies a Java Authentication and Authorization Service (JAAS) configuration file. This is only necessary for Kerberos-enabled deployments.

For example:

hbase-indexer list-indexers --http http://lily01.example.com:11060/indexer/ \
--jaas $HOME/jaas.conf --zookeeper zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181

Configuring Lily HBase Indexer Security

The Lily HBase indexer supports Kerberos for authentication, and Apache Sentry for authorization. For more information, see Configuring Lily HBase Indexer Security.