Lily HBase Near Real Time Indexing for Cloudera Search
The Lily HBase NRT Indexer service is a flexible, scalable,
fault-tolerant, transactional, near real-time (NRT) system for processing
a continuous stream of HBase cell updates into live search indexes.
Typically it takes seconds for data ingested into HBase to appear in
search results; this duration is tunable. The Lily HBase Indexer uses
SolrCloud to index data stored in HBase. As HBase applies inserts,
updates, and deletes to HBase table cells, the indexer keeps Solr
consistent with the HBase table contents, using standard HBase
replication. The indexer supports flexible custom application-specific
rules to extract, transform, and load HBase data into Solr. Solr search
results can contain columnFamily:qualifier
links back to
the data stored in HBase. This way, applications can use the Search result
set to directly access matching raw HBase cells. Indexing and searching do
not affect operational stability or write throughput of HBase because the
indexing and searching processes are separate and asynchronous to
HBase.
To accommodate the HBase ingest load, you can run as many Lily HBase Indexer services on different hosts as required. Because the indexing work is shared by all indexers, you can scale the service by adding more indexers. The recommended number of indexer is 1 for each HBase RegionServer but in a High Availability environment five worker nodes is the minimum for acceptable performance and reliability. You can co-locate Lily HBase Indexer services with Solr servers on the same set of hosts. RegionServers can also be co-locate with Lily HBase Indexer on the same host to improve performance.
The Lily HBase NRT Indexer service must be deployed in an environment with a running HBase cluster, a running SolrCloud cluster (the Solr service in Cloudera Manager), and at least one ZooKeeper quorum.