This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

Cloudera Search Components

Search interacts with existing CDH components, using many of them to solve different problems. The following table lists CDH components that contribute to Search process and the ways in which each component helps:

Component	Contribution	Applicable To
HDFS	Source documents are typically stored in HDFS. These documents are indexed and made searchable. The files that support Search such as Lucene index files and write-ahead logs are also stored in HDFS. Using HDFS provides simpler provisioning on a larger base, redundancy, and fault tolerance. As a result of using HDFS, Search servers are essentially stateless, meaning there are minimal consequences from node failures. HDFS also provides additional benefits such as snapshotting, inter-cluster replication, and disaster recovery.	All cases
MapReduce	Search includes a pre-built MapReduce-based job. This job can be used for on-demand or scheduled indexing of any supported data set stored in HDFS. This job utilizes cluster resources for scalable batch indexing.	Many cases
Flume	Cloudera Search includes a Flume sink that enables writing events directly to indexers deployed on the cluster, enabling data indexing during ingestion.	Many cases
Hue	Cloudera Search includes a Hue frontend search application that uses standard Solr APIs is included. The application can interact with data indexed in HDFS. The application provides support for the Solr standard query language, visualization of faceted search functionality, and a typical full text search GUI-based.	Many cases
ZooKeeper	Coordinates distribution of data and metadata, also known as shards. ZooKeeper provides automatic failover, increasing service resiliency.	Many cases
HBase	Supports indexing of stored data, extracting columns, column families, and key information as fields. Because HBase does not use secondary indexing, Search can complete full text searches of content in rows and tables in HBase.	Some cases
Cloudera Manager	Deploys, configures, manages, and monitors the search processes and resource utilization across services on the cluster. Search does not require Cloudera Manager, but Cloudera Manager helps simplify Search administration.	Some cases
Oozie	Automates scheduling and management of indexing jobs. Oozie can check for new data and begin indexing jobs, as required.	Some cases
Impala	Further analyzes search results.	Some cases
Hive	Further analyzes search results.	Some cases
Avro	Includes metadata that Search can use for indexing.	Some cases
Sqoop	Ingests data in batch and enables data availability for batch indexing.	Some cases
Mahout	Applies machine learning processing to search results.	Some cases

Page generated September 3, 2015.