Load and Index Data in Search
- Packages: /usr/share/doc. If Search for CDH 5.1.0 is installed to the default location using packages, the Quick Start script is found in /usr/share/doc/search-1.0.0+cdh5.1.0+0/quickstart.
- Parcels: /opt/cloudera/parcels/CDH/share/doc. If Search for CDH 5.1.0 is installed to the default location using parcels, the Quick Start script is found in /opt/cloudera/parcels/CDH/share/doc/search-1.0.0+cdh5.1.0+0/quickstart.
The script uses several defaults. Defaults that you might be most likely to modify include:
Parameter | Default | Notes |
---|---|---|
NAMENODE_CONNECT | `hostname`:8020 | For use on an HDFS HA cluster. If you use NAMENODE_CONNECT, do not use NAMENODE_HOST or NAMENODE_PORT. |
NAMENODE_HOST | `hostname` | If you use NAMENODE_HOST and NAMENODE_PORT, then do not use NAMENODE_CONNECT. |
NAMENODE_PORT | 8020 | If you use NAMENODE_HOST and NAMENODE_PORT , then do not use NAMENODE_CONNECT. |
ZOOKEEPER_HOST | `hostname` | |
ZOOKEEPER_PORT | 2181 | |
ZOOKEEPER_ROOT | /solr | |
HDFS_USER | ${HDFS_USER:="${USER}"} | |
SOLR_HOME | /opt/cloudera/parcels/SOLR/lib/solr |
By default, the script assumes it is running on the NameNode host, which is also running ZooKeeper. Override these defaults with custom values when you start quickstart.sh. For example, to use an alternate namenode and HDFS user ID, you could start the script as follows:
$ NAMENODE_HOST=nnhost HDFS_USER=jsmith ./quickstart.sh
Further discussion of the script
The first time the script runs, it downloads required files such as the Enron data and configuration files. The script can be run again, but on subsequent runs, it uses the Enron information already downloaded, as opposed to downloading this information again. On such subsequent runs, the existing data is used to recreate the enron-email-collection SolrCloud collection.
The script completes the following tasks:
- Set variables such as host names and directories.
- Create a directory to which to copy the Enron data and then copy that data to this location. This data is about 422 MB and in some tests took around five minutes to download two minutes to untar.
- Create directories for the current user in HDFS, change ownership of that directory to the current user, create a directory for the Enron data, and load the Enron data to that directory. In some tests, it took around a minute to copy the approximately 3 GB of untarred data.
- Use solrctl to create a template of the instance directory.
- Use solrctl to create a new Solr collection for the Enron mail collection.
- Create a directory to which the MapReduceBatchIndexer can write results. Ensure the directory is empty.
- Use the MapReduceIndexerTool to index the Enron data and push the result live to enron-mail-collection. In some tests, it took around seven minutes to complete this task.
<< Prerequisites | Using Search to Query Loaded Data >> | |