This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Cloudera Search Components

Search interacts with existing CDH components, using many of them to solve different problems. The following table lists CDH components that contribute to Search process and the ways in which each component helps:

Component

Contribution

Applicable To

HDFS

Source documents are typically stored in HDFS. These documents are indexed and made searchable. The files that support Search such as Lucene index files and write-ahead logs are also stored in HDFS. Using HDFS provides simpler provisioning on a larger base, redundancy, and fault tolerance. As a result of using HDFS, Search servers are essentially stateless, meaning there are minimal consequences from node failures. HDFS also provides additional benefits such as snapshotting, inter-cluster replication, and disaster recovery.

All cases

MapReduce

Search includes a pre-built MapReduce-based job. This job can be used for on-demand or scheduled indexing of any supported data set stored in HDFS. This job utilizes cluster resources for scalable batch indexing.

Many cases

Flume

Cloudera Search includes a Flume sink that enables writing events directly to indexers deployed on the cluster, enabling data indexing during ingestion.

Many cases

Hue

Cloudera Search includes a Hue frontend search application that uses standard Solr APIs is included. The application can interact with data indexed in HDFS. The application provides support for the Solr standard query language, visualization of faceted search functionality, and a typical full text search GUI-based.

Many cases

ZooKeeper

Coordinates distribution of data and metadata, also known as shards. ZooKeeper provides automatic failover, increasing service resiliency.

Many cases

HBase

Supports indexing of stored data, extracting columns, column families, and key information as fields. Because HBase does not use secondary indexing, Search can complete full text searches of content in rows and tables in HBase.

Some cases

Cloudera Manager

Deploys, configures, manages, and monitors the search processes and resource utilization across services on the cluster. Search does not require Cloudera Manager, but Cloudera Manager helps simplify Search administration.

Some cases

Oozie

Automates scheduling and management of indexing jobs. Oozie can check for new data and begin indexing jobs, as required.

Some cases

Impala

Further analyzes search results.

Some cases

Hive

Further analyzes search results.

Some cases

Avro

Includes metadata that Search can use for indexing.

Some cases

Sqoop

Ingests data in batch and enables data availability for batch indexing.

Some cases

Mahout

Applies machine learning processing to search results.

Some cases

Page generated September 3, 2015.