This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Understanding Cloudera Search

Cloudera Search opens CDH to full-text search and exploration of data in HDFS and Apache HBase. Cloudera Search is powered by Apache Solr, enriching the industry standard open source search solution with Hadoop platform integration, enabling a new generation of Big Data search. Cloudera Search makes it especially easy to query large data sets.

Understanding How Search Fits into Cloudera Offerings

Cloudera Search is another tool that fits into the broader set of solutions available for analyzing information in large data sets today. With especially large sets of data, it is neither possible to store all information reliably on a single machine nor is it possible to query such massive sets of data. CDH provides both the means to store the large data sets in existence today and the tools to query this data. At present, some of the ways data can be explored include:
  • MapReduce jobs
  • Cloudera Impala queries
  • Cloudera Search queries

While CDH alone allows storage and access of large data sets, without Cloudera Search, users must create MapReduce jobs. This requires technical knowledge and each job can take minutes or more to run, and the longer run-times associated with MapReduce jobs can interrupt the process of exploring data. To provide a more immediate query and response experience and to eliminate the need to write MapReduce applications, Cloudera offers Real-Time Query or Impala. Impala returns results in seconds rather than minutes.

While Impala is a fast and powerful application, it uses SQL-based querying syntax. For users who are not familiar with SQL, using Impala may be challenging. To provide rapid results for less technical users, there is Cloudera Search. Impala, Hive, and Pig also require a structure, which is applied at query time, whereas Search supports free-text search over any data or fields you have indexed.

Understanding How Search Leverages Existing Infrastructure

Any data already present in a CDH deployment can be indexed and made query-able by Cloudera Search. For data that is not stored in CDH, Cloudera Search offers tools for loading data into the existing infrastructure, as well as the ability to index data as it is moved to HDFS or written to HBase.

By leveraging existing infrastructure, Cloudera Search eliminates the need to create new, redundant structures. Furthermore, Cloudera Search leverages services provided by CDH and Cloudera Manager in such a way that it does not interfere with other tasks running in the same environment. This means that you get all the benefits of reusing existing infrastructure, without the costs and problems associated with running multiple services in the same set of systems.

Page generated September 3, 2015.