Batch Indexing into Online Solr Servers Using GoLive Feature
MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using the GoLive feature, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud.
- Delete all existing documents in Solr.
$ solrctl collection --deletedocs collection3
- Run the MapReduce job using the GoLive option. Be sure to replace
$NNHOST and $ZKHOST in the command with your NameNode and
ZooKeeper hostnames and port numbers, as required. Note that you do not need to specify
--solr-home-dir because the job accesses
it from
ZooKeeper.
$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \ /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \ --zk-host $ZKHOST:2181/solr --collection collection3 \ hdfs://$NNHOST:8020/user/$USER/indir
- Check the job tracker status at http://localhost:50030/jobtracker.jsp.
- Once the job completes, try some Solr queries. For example, for
myserver.example.com, use: http://myserver.example.com:8983/solr/collection3/select?q=*%3A*&wt=json&indent=trueFor command line help on how to run a Hadoop MapReduce job, use the following command:
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool --help
Note: For development purposes, use the MapReduceIndexerTool --dry-run option to run in local mode and print documents to stdout, instead of loading them to Solr. Using this option causes the morphline to execute in the client process without submitting a job to MapReduce. Executing in the client process provides quicker turnaround during early trial and debug sessions.Note: To print diagnostic information, such as the content of records as they pass through the morphline commands, consider enabling TRACE log level. You can enable TRACE log level diagnostics by adding the following entry to your log4j.properties file:log4j.logger.org.kitesdk.morphline=TRACE
The log4j.properties file can be passed via the MapReduceIndexerTool --log4j command line option.
<< Batch Indexing Using MapReduce | Batch Indexing into Offline Solr Shards >> | |