Lily HBase near real time indexing using Morphlines
You can use the Lily HBase Indexer for near real time (NRT) indexing updates to HBase tables.
The Lily HBase NRT Indexer service is a flexible, scalable, fault-tolerant, transactional,
NRT system for processing a continuous stream of HBase cell updates into live search
indexes. Typically it takes seconds for data ingested into HBase to appear in search
results; this duration is tunable. The Lily HBase Indexer uses SolrCloud to index data
stored in HBase. As HBase applies inserts, updates, and deletes to HBase table cells, the
indexer keeps Solr consistent with the HBase table contents, using standard HBase
replication. The indexer supports flexible custom application-specific rules to extract,
transform, and load HBase data into Solr. Solr search results can contain
columnFamily:qualifier
links back to the data stored in HBase. This
way, applications can use the Search result set to directly access matching raw HBase
cells. Indexing and searching do not affect operational stability or write throughput of
HBase because the indexing and searching processes are separate and asynchronous to
HBase.
To accommodate the HBase ingest load, you can run as many Lily HBase Indexer services on different hosts as required. Because the indexing work is shared by all indexers, you can scale the service by adding more indexers. The recommended number of indexer is 1 for each HBase RegionServer but in a High Availability environment five worker nodes is the minimum for acceptable performance and reliability. You can co-locate Lily HBase Indexer services with Solr servers on the same set of hosts. RegionServers can also be co-locate with Lily HBase Indexer on the same host to improve performance.
The Lily HBase NRT Indexer service must be deployed in an environment with a running HBase cluster, a running SolrCloud cluster (the Solr service in Cloudera Manager), and at least one ZooKeeper quorum.