Indexing
Content must be indexed before it can be searched. Indexing comprises the following steps:
- Extraction, transformation, and loading (ETL) - Use existing engines or
frameworks such as Apache Tika or Cloudera Morphlines.
- Content and metadata extraction
- Schema mapping
- Index creation using Lucene.
- Index creation
- Index serialization
Indexes are typically stored on a local file system. Lucene supports additional index writers and readers. One HDFS-based interface implemented as part of Apache Blur is integrated with Cloudera Search and has been optimized for CDP-stored indexes. All index data in Cloudera Search is stored in and served from HDFS.
You can index content in the following ways:
Batch indexing using MapReduce
To use MapReduce to index documents, run a MapReduce job on content in HDFS to produce a Lucene index. The Lucene index is written to HDFS, and this index is subsequently used by Search services to provide query results.
Batch indexing is most often used when bootstrapping a Search cluster. The Map phase of the MapReduce task parses input into indexable documents, and the Reduce phase indexes the documents produced by the Map. You can also configure a MapReduce-based indexing job to use all assigned resources on the cluster, utilizing multiple reducing steps for intermediate indexing and merging operations, and then writing the reduction to the configured set of shard sets for the service. This makes the batch indexing process as scalable as MapReduce workloads.
NRT indexing using the API
Other clients can complete NRT indexing. This is done when the client first writes files directly to HDFS and then triggers indexing using the Solr REST API. Specifically, the API does the following:
- Extract content from the document contained in HDFS, where the document is referenced by a URL.
- Map the content to fields in the search schema.
- Create or update a Lucene index.
This is useful if you index as part of a larger workflow. For example, you could trigger indexing from an Oozie workflow.