Load and Index Data in Search

Execute the script found in a subdirectory of the following locations. The path for the script often includes the product version, such as Cloudera Manager 5.4.x, so path details vary. To address this issue, use wildcards.
  • Packages: /usr/share/doc. If Search for CDH 5.4.10 is installed to the default location using packages, the Quick Start script is found in /usr/share/doc/search-*/quickstart.
  • Parcels: /opt/cloudera/parcels/CDH/share/doc. If Search for CDH 5.4.10 is installed to the default location using parcels, the Quick Start script is found in /opt/cloudera/parcels/CDH/share/doc/search-*/quickstart.

The script uses several defaults that you might want to modify:

Script Parameters and Defaults
Parameter Default Notes
NAMENODE_CONNECT `hostname`:8020 For use on an HDFS HA cluster. If you use NAMENODE_CONNECT, do not use NAMENODE_HOST or NAMENODE_PORT.
NAMENODE_HOST `hostname` If you use NAMENODE_HOST and NAMENODE_PORT, do not use NAMENODE_CONNECT.
NAMENODE_PORT 8020 If you use NAMENODE_HOST and NAMENODE_PORT, do not use NAMENODE_CONNECT.
ZOOKEEPER_ENSEMBLE `hostname`:2181/solr Zookeeper ensemble to point to. For example:
zk1,zk2,zk3:2181/solr

If you use ZOOKEEPER_ENSEMBLE, do not use ZOOKEEPER_HOST or ZOOKEEPER_PORT, ZOOKEEPER_ROOT.

ZOOKEEPER_HOST `hostname`  
ZOOKEEPER_PORT 2181  
ZOOKEEPER_ROOT /solr  
HDFS_USER ${HDFS_USER:="${USER}"}  
SOLR_HOME /opt/cloudera/parcels/SOLR/lib/solr  

By default, the script is configured to run on the NameNode host, which is also running ZooKeeper. Override these defaults with custom values when you start quickstart.sh. For example, to use an alternate NameNode and HDFS user ID, you could start the script as follows:

$ NAMENODE_HOST=nnhost HDFS_USER=jsmith ./quickstart.sh

The first time the script runs, it downloads required files such as the Enron data and configuration files. If you run the script again, it uses the Enron information already downloaded, as opposed to downloading this information again. On such subsequent runs, the existing data is used to re-create the enron-email-collection SolrCloud collection.

The script also generates a Solr configuration and creates a collection in SolrCloud. The following sections describes what the script does and how you can complete these steps manually, if desired. The script completes the following tasks:

  1. Set variables such as hostnames and directories.
  2. Create a directory to which to copy the Enron data and then copy that data to this location. This data is about 422 MB and in some tests took about five minutes to download and two minutes to untar.
  3. Create directories for the current user in HDFS, change ownership of that directory to the current user, create a directory for the Enron data, and load the Enron data to that directory. In some tests, it took about a minute to copy approximately 3 GB of untarred data.
  4. Use solrctl to create a template of the instance directory.
  5. Use solrctl to create a new Solr collection for the Enron mail collection.
  6. Create a directory to which the MapReduceBatchIndexer can write results. Ensure that the directory is empty.
  7. Use the MapReduceIndexerTool to index the Enron data and push the result live to enron-mail-collection. In some tests, it took about seven minutes to complete this task.