General
The following are general questions about Cloudera Search and the answers to those questions.
What is Cloudera Search?
Cloudera Search is Apache Solr integrated with CDH, including Apache Lucene, Apache SolrCloud, Apache Tika, and Apache Hadoop MapReduce and HDFS. Cloudera Search also includes valuable integrations that make searching more scalable, easy to use, and optimized for both near-real-time and batch-oriented indexing. These integrations include Cloudera Morphlines, a customizable transformation chain that simplifies loading any type of data into Cloudera Search.
What is the difference between Lucene and Solr?
Lucene is a low-level search library that is accessed by a Java API. Solr is a search server that runs in a servlet container and provides structure and convenience around the underlying Lucene library.
What is Apache Tika?
The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
How does Cloudera Search relate to web search?
Traditional web search engines crawl web pages on the Internet for content to index. Cloudera Search indexes files and data that are stored in HDFS and HBase. To make web data available through Cloudera Search, it needs to be downloaded and stored in Cloudera Enterprise.
How does Cloudera Search relate to enterprise search?
Enterprise search connects with different backends (such as RDBMS and filesystems) and indexes data in all those systems. Cloudera Search is intended as a full-text search capability for data in CDH. Cloudera Search is a tool added to the Cloudera data processing platform and does not aim to be a stand-alone search solution, but rather a user-friendly interface to explore data in Hadoop and HBase.
How does Cloudera Search relate to custom search applications?
Custom and specialized search applications are an excellent complement to the Cloudera data-processing platform. Cloudera Search is not designed to be a custom application for niche vertical markets. However, Cloudera Search does include a simple search GUI through a plug-in application for Hue. The Hue plug-in application is based on the Solr API and allows for easy exploration, along with all of the other Hadoop frontend applications in Hue.
How can I use the Search platform? Is it exclusively available through Hue?
Search supports any application that uses standard Solr APIs. You can build custom applications using the Solr API to address specific Search use-cases.
You can also use the Hue Search application.
Does Search support querying and aggregating results across Solr collections? If data is separated into different collections, does Search support querying across those collections and aggregating results?
Search does not currently support automatic querying across collections, but you can create collection aliases that include mult
Which Solr Server should I send my queries to?
Any Solr Server can accept and process client connections.
How does Search handle querying different document types? For example, how does Search evaluate a query against both text and images?
It is possible to query disparate documents, but it is often most effective and powerful to query a collection of documents that contains documents of similar types.
Before you can query a dataset, you must first make it available to Cloudera Search. This dataset is organized as a collection, and Cloudera Search indexes the documents according to a specified schema. For example, a collection of emails might use a schema that includes date sent, sender, recipient, subject, size, and body. A collection of images might use a schema that includes date created, creator, subject, size, and geolocation.
Cloudera Search supports two ways of organizing the example of the emails and images:
- Separate collections - One collection for emails and one collection for images is the most flexible solution. In this case, each collection can be queried and ranked using all attributes because each collection's index applies to all documents in each collection. It is possible to get very specific results and rankings for detailed queries, but only the items in queried collection are returned in the results.
- Unified collection - A single collection for both emails and images is the most inclusive, wide-ranging solution. You can query all documents at once, but you can only search and filter common attributes such as size and date. Querying emails for geolocation values or querying images for sender values does not make sense.
Do Search security features use Kerberos?
Yes, Cloudera Search includes support for Kerberos authentication. Search continues to use simple authentication with the anonymous user as the default configuration, but Search now supports changing the authentication scheme to Kerberos. All required packages are installed during the installation or upgrade process. Additional configuration is required before Kerberos is available in your environment.
Can I restrict access to collections?
Yes, Cloudera Search supports Apache Ranger for authorization. For more information, see Using Ranger to Provide Authorization in CDP.
Do I need to configure Ranger restrictions for each access mode, such as for the admin console and for the command line?
Ranger restrictions are consistently applied regardless of the way users attempt to complete actions. For example, restricting access to data in a collection consistently restricts that access, whether queries come from the command line, from a browser, or through the admin console.
Does Search support indexing data stored in JSON files and objects?
Yes, you can use the readJson and extractJsonPaths morphline commands that are included with the CDK to access JSON data and files. For more information, see cdk-morphlines-json.
How can I set up Cloudera Search so that results include links back to the source that contains the result?
You can use stored results fields to create links back to source documents. For information on data types, including the option to set results fields as stored, see the Solr Wiki page on SchemaXml.
For example, with MapReduceIndexerTool
you can take
advantage of fields such as file_path
. See MapReduceIndexerTool for more
information. The output from the MapReduceIndexerTool includes file
path information that can be used to construct links to source
documents.
If you use the Hue UI, you can link to data in HDFS by inserting links of the form:
<a href="/filebrowser/download/{{file_path}}?disposition=inline">Download</a>
Why do I get an error “no field name specified in query and no default specified via 'df' param" when I query a Schemaless collection?
Schemaless collections initially have no default or
df
setting. As a result, simple searches that might
succeed on non-Schemaless collections may fail on Schemaless
collections.
When a user submits a search, it must be clear which field Cloudera
Search should query. A default field, or df
, is often
specified in solrconfig.xml
, and when this is the
case, users can submit queries that do not specify fields. In such
situations, Solr uses the df
value.
df
field. As a result, when query request handlers
do not specify a df
, errors can result. This issue
can be addressed in several ways:- Queries can specify any valid field name on
which to search. In such a case, no
df
is required. - Queries can specify a default field using the
df
parameter. In such a case, thedf
is specified in the query. - You can uncomment the
df
section of the generated schemalesssolrconfig.xml
file and set thedf
parameter to the desired field. In such a case, all subsequent queries can use thedf
field insolrconfig.xml
if no field ordf
value is specified.