Flume Near Real-Time Indexing Reference
The Flume Solr Sink is a flexible, scalable, fault tolerant, transactional, near real-time (NRT) system for processing a continuous stream of records into live search indexes. Latency from the time of data arrival to the time data appears in search query results is measured in seconds and is tunable.
Data flows from sources through Flume hosts across the network to Flume Solr sinks. The sinks extract the relevant data, transform it, and load it into a set of live Solr search servers, which in turn serve queries to end users or search applications.
The ETL functionality is flexible and customizable, using chains of morphline commands that pipe records from one transformation command to another. Commands to parse and transform a set of standard data formats such as Avro, CSV, text, HTML, XML, PDF, Word, or Excel, are provided out of the box. You can add additional custom commands and parsers as morphline plug-ins for other file or data formats. Do this by implementing a simple Java interface that consumes a record such as a file in the form of an InputStream plus some headers and contextual metadata. The record consumed by the Java interface is used to generate record output. Any kind of data format can be indexed, any Solr documents for any kind of Solr schema can be generated, and any custom ETL logic can be registered and run.
Routing to multiple Solr collections improves multi-tenancy, and routing to a SolrCloud cluster improves scalability. Flume SolrSink servers can be co-located with live Solr servers serving end user queries, or deployed on separate industry-standard hardware to improve scalability and reliability. Indexing load can be spread across a large number of Flume SolrSink servers, and Flume features such as Load balancing Sink Processor can help improve scalability and achieve high availability. .
Flume indexing provides low-latency data acquisition and querying. It complements (instead of replaces) use cases based on batch analysis of HDFS data using MapReduce. In many use cases, data flows simultaneously from the producer through Flume into both Solr and HDFS using features such as the Replicating Channel Selector to replicate an incoming flow into two output flows. You can use near real-time ingestion as well as batch analysis tools.
For a more comprehensive discussion of the Flume Architecture, see Large Scale Data Ingestion using Flume.
After configuring Flume, start it as detailed in Flume Installation.
See the Cloudera Search Tutorial for exercises that show how to configure and run a Flume SolrSink to index documents.