This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Batch Indexing into Online Solr Servers Using GoLive Feature

MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using the GoLive feature, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud.

Delete all existing documents in Solr.

$ solrctl collection --deletedocs collection3

Run the MapReduce job using the GoLive option. Be sure to replace $NNHOST and $ZKHOST in the command with your NameNode and ZooKeeper hostnames and port numbers, as required. Note that you do not need to specify --solr-home-dir because the job accesses it from ZooKeeper.

$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
/usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool -D \
'mapred.child.java.opts=-Xmx500m' --log4j \
/usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
/usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \
--zk-host $ZKHOST:2181/solr --collection collection3 \
hdfs://$NNHOST:8020/user/$USER/indir

Check the job tracker status at http://localhost:50030/jobtracker.jsp.
Once the job completes, try some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection3/select?q=*%3A*&wt=json&indent=true
For command line help on how to run a Hadoop MapReduce job, use the following command:
```
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool --help
```
Note: For development purposes, use the MapReduceIndexerTool --dry-run option to run in local mode and print documents to stdout, instead of loading them to Solr. Using this option causes the morphline to execute in the client process without submitting a job to MapReduce. Executing in the client process provides quicker turnaround during early trial and debug sessions.
Note: To print diagnostic information, such as the content of records as they pass through the morphline commands, consider enabling TRACE log level. You can enable TRACE log level diagnostics by adding the following entry to your log4j.properties file:
```
log4j.logger.org.kitesdk.morphline=TRACE
```
The log4j.properties file can be passed via the MapReduceIndexerTool --log4j command line option.

Page generated September 3, 2015.