This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

Batch Indexing into Online Solr Servers Using GoLive Feature

MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using the GoLive feature, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud.

  1. Delete all existing documents in Solr.
    $ solrctl collection --deletedocs collection3
  2. Run the MapReduce job using the GoLive option. Be sure to replace $NNHOST and $ZKHOST in the command with your NameNode and ZooKeeper hostnames and port numbers, as required. Note that you do not need to specify --solr-home-dir because the job accesses it from ZooKeeper.
    $ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
    /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
    org.apache.solr.hadoop.MapReduceIndexerTool -D \
    'mapred.child.java.opts=-Xmx500m' --log4j \
    /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
    /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
    --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \
    --zk-host $ZKHOST:2181/solr --collection collection3 \
    hdfs://$NNHOST:8020/user/$USER/indir
  3. Check the job tracker status at http://localhost:50030/jobtracker.jsp.
  4. Once the job completes, try some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection3/select?q=*%3A*&wt=json&indent=true
    For command line help on how to run a Hadoop MapReduce job, use the following command:
    $ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
    org.apache.solr.hadoop.MapReduceIndexerTool --help
      Note: For development purposes, use the MapReduceIndexerTool --dry-run option to run in local mode and print documents to stdout, instead of loading them to Solr. Using this option causes the morphline to execute in the client process without submitting a job to MapReduce. Executing in the client process provides quicker turnaround during early trial and debug sessions.
      Note: To print diagnostic information, such as the content of records as they pass through the morphline commands, consider enabling TRACE log level. You can enable TRACE log level diagnostics by adding the following entry to your log4j.properties file:
    log4j.logger.org.kitesdk.morphline=TRACE
    The log4j.properties file can be passed via the MapReduceIndexerTool --log4j command line option.
Page generated September 3, 2015.