Preparing to Install Cloudera Search

Cloudera Search provides interactive search and scalable indexing. Before you begin installing Cloudera Search:
  • Decide whether to install Cloudera Search using Cloudera Manager or using package management tools.
  • Decide on which machines to install Cloudera Search and with which other services to collocate Search.
  • Consider the sorts of tasks, workloads, and types of data you will be searching. This information can help guide your deployment process.

Choosing Where to Deploy the Cloudera Search Processes

You can collocate a Cloudera Search server (solr-server package) with a Hadoop TaskTracker (MRv1) and a DataNode. When collocating with TaskTrackers, be sure that the machine resources are not oversubscribed. Start with a small number of MapReduce slots and increase them gradually.

For instructions describing how and where to install solr-mapreduce, see Installing MapReduce Tools for use with Cloudera Search. For information about the Search package, see the Using Cloudera Search section in the Cloudera Search Tutorial.

Guidelines for Deploying Cloudera Search

Memory

CDH initially deploys Solr with a Java virtual machine (JVM) size of 1 GB. In the context of Search, 1 GB is a small value. Starting with this small value simplifies JVM deployment, but the value is insufficient for most actual use cases. Consider the following when determining an optimal JVM size for production usage:

  • The more searchable material you have, the more memory you need. All things being equal, 10 TB of searchable data requires more memory than 1 TB of searchable data.
  • What is indexed in the searchable material. Indexing all fields in a collection of logs, email messages, or Wikipedia entries requires more memory than indexing only the Date Created field.
  • The level of performance required. If the system must be stable and respond quickly, more memory may help. If slow responses are acceptable, you may be able to use less memory.

To ensure an appropriate amount of memory, consider your requirements and experiment in your environment. In general:

  • Machines with 16 GB RAM are sufficient for some smaller loads or for evaluation.
  • Machines with 32 GB RAM are sufficient for some production environments.
  • Machines with 96 GB RAM are sufficient for most situations.

Deployment Requirements

The information in this topic should be considered as guidance instead of absolute requirements. Using a sample application to benchmark different use cases and data types and sizes can help you identify the most important performance factors.

To determine how best to deploy search in your environment, define use cases. The same Solr index can have very different hardware requirements, depending on queries performed. The most common variation in hardware requirement is memory. For example, the memory requirements for faceting vary depending on the number of unique terms in the faceted field. Suppose you want to use faceting on a field that has ten unique values. In this case, only ten logical containers are required for counting. No matter how many documents are in the index, memory overhead is almost nonexistent.

Conversely, the same index could have unique timestamps for every entry, and you want to facet on that field with a : -type query. In this case, each index requires its own logical container. With this organization, if you had a large number of documents—500 million, for example—then faceting across 10 fields would increase the RAM requirements significantly.

For this reason, use cases and some characterizations of the data is required before you can estimate hardware requirements. Important parameters to consider are:

  • Number of documents. For Cloudera Search, sharding is almost always required.
  • Approximate word count for each potential field.
  • What information is stored in the Solr index and what information is only for searching. Information stored in the index is returned with the search results.
  • Foreign language support:
    • How many different languages appear in your data?
    • What percentage of documents are in each language?
    • Is language-specific search supported? This determines whether accent folding and storing the text in a single field is sufficient.
    • What language families will be searched? For example, you could combine all Western European languages into a single field, but combining English and Chinese into a single field is not practical. Even with more similar sets of languages, using a single field for different languages can be problematic. For example, sometimes accents alter the meaning of a word, and in such a case, accent folding loses important distinctions.
  • Faceting requirements:
    • Be wary of faceting on fields that have many unique terms. For example, faceting on timestamps or free-text fields typically has a high cost. Faceting on a field with more than 10,000 unique values is typically not useful. Ensure that any such faceting requirement is necessary.
    • What types of facets are needed? You can facet on queries as well as field values. Faceting on queries is often useful for dates. For example, “in the last day” or “in the last week” can be valuable. Using Solr Date Math to facet on a bare “NOW” is almost always inefficient. Facet-by-query is not memory-intensive because the number of logical containers is limited by the number of queries specified, no matter how many unique values are in the underlying field. This can enable faceting on fields that contain information such as dates or times, while avoiding the problem described for faceting on fields with unique terms.
  • Sorting requirements:
    • Sorting requires one integer for each document (maxDoc), which can take up significant memory. Additionally, sorting on strings requires storing each unique string value.
  • Paging requirements. End users rarely look beyond the first few pages of search results. For use cases requiring deep paging (paging through a large number of results), using cursors can improve performance and resource utilization. For more information, see Pagination of Results on the Apache Solr wiki. Cursors are supported in CDH 5.2 and higher.
  • Is advanced search capability planned? If so, how will it be implemented? Significant design decisions depend on user requirements:
    • Can users be expected to learn about the system? “Advanced” screens could intimidate e-commerce users, but these screens may be most effective if users can be expected to learn them.
    • How long will your users wait for results? Data mining results in longer user wait times. You want to limit user wait times, but other design requirements can affect response times.
  • How many simultaneous users must your system accommodate?
  • Update requirements. An update in Solr refers both to adding new documents and changing existing documents:
    • Loading new documents:
      • Bulk. Will the index be rebuilt from scratch in some cases, or will there only be an initial load?
      • Incremental. At what rate will new documents enter the system?
    • Updating documents. Can you characterize the expected number of modifications to existing documents?
    • How much latency is acceptable between when a document is added to Solr and when it is available in Search results?
  • Security requirements. Solr has no built-in security options, although Cloudera Search supports authentication using Kerberos and authorization using Sentry. In Solr, document-level security is usually best accomplished by indexing authorization tokens with the document. The number of authorization tokens applied to a document is largely irrelevant; for example, thousands are reasonable but can be difficult to administer. The number of authorization tokens associated with a particular user should be no more than 100 in most cases. Security at this level is often enforced by appending an “fq” clause to the query, and adding thousands of tokens in an “fq” clause is expensive.
    • A post filter, also know as a no-cache filter, can help with access schemes that cannot use an "fq" clause. These are not cached and are applied only after all less-expensive filters are applied.
    • If grouping, faceting is not required to accurately reflect true document counts, so you can use some shortcuts. For example, ACL filtering is expensive in some systems, sometimes requiring database access. If completely accurate faceting is required, you must completely process the list to reflect accurate facets.
  • Required query rate, usually measured in queries-per-second (QPS):
    • At a minimum, deploy machines with sufficient hardware resources to provide an acceptable response rate for a single user. You can create queries that burden the system so much that performance for even a small number of users is unacceptable. In this case, resharding is necessary.
    • If QPS is only somewhat slower than required and you do not want to reshard, you can improve performance by adding replicas to each shard.
    • As the number of shards in your deployment increases, so too does the likelihood that one of the shards will be unusually slow. In this case, the general QPS rate falls, although very slowly. This typically occurs as the number of shards reaches the hundreds.