Advantages of defining a schema for production use

Although schemaless mode facilitates quick adoption and exploratory use, you are recommended to implement a schema for production use of Cloudera Search. This ensures that ingestion of data is consistent with expectations and query results are reliable.

Schemaless Solr is useful for getting started quickly and for understanding the underlying structure of the data you want to search. However, Schemaless Solr is not recommended for production use cases. Because the schema is automatically generated, a mistake like misspelling the name of the field alters the schema, rather than producing an error. The mistake may not be caught until much later and once caught, may require re-indexing to fix. Also, an unexpected input format may cause the type guessing to pick a field type that is incompatible with data that is subsequently ingested, preventing further ingestion until the incompatibility is manually addressed. Such a case is rare, but could occur. For example, if the first instance of a field was an integer, such as '9', but subsequent entries were text such as '10 Spring Street', the schema would make it impossible to properly ingest those subsequent entries. Therefore, Schemaless Solr may be useful for deciding on a schema during the exploratory stage of development, but Cloudera recommends defining the schema in the traditional way before moving to production.

When constructing a schema, use data types that most accurately describe the data that the fields will contain. For example:
  • Use the tdate type for dates. Do this instead of representing dates as strings.
  • Consider using the text type that applies to your language, instead of using String. For example, you might use text_en. Text types support returning results for subsets of an entry. For example, querying on "john" would find "John Smith", whereas with the string type, only exact matches are returned.
  • For IDs, use the string type.