This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

Batch Indexing into Offline Solr Shards

You can run the MapReduce job again, but this time without the GoLive feature. This causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.

  1. Delete all existing documents in Solr.
    $ solrctl collection --deletedocs collection3
    $ sudo -u hdfs hadoop fs -rm -r -skipTrash /user/$USER/outdir
  2. Run the Hadoop MapReduce job. Be sure to replace $NNHOST in the command with your NameNode hostname and port number, as required.
    $ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \
    /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
    org.apache.solr.hadoop.MapReduceIndexerTool -D \
    'mapred.child.java.opts=-Xmx500m' --log4j \
    /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \
    /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \
    --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \
    $HOME/collection3 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
  3. Check the job tracker status. For example, for the localhost, use http://localhost:50030/jobtracker.jsp.
  4. Once the job completes, check the generated index files. Individual shards are written to the results directory as with names of the form part-00000, part-00001, part-00002. There are only two shards in this example.
    $ hadoop fs -ls /user/$USER/outdir/results
    $ hadoop fs -ls /user/$USER/outdir/results/part-00000/data/index
  5. Stop Solr on each node of the cluster.
    $ sudo service solr-server stop
  6. List the host name folders used as part of the path to each index in the SolrCloud cluster.
    $ hadoop fs -ls /solr/collection3
  7. Move index shards into place.
    1. Remove outdated files:
      $ sudo -u solr hadoop fs -rm -r -skipTrash \
      /solr/collection3/$HOSTNAME1/data/index
      $ sudo -u solr hadoop fs -rm -r -skipTrash \
      /solr/collection3/$HOSTNAME2/data/index
    2. Ensure correct ownership of required directories:
      $ sudo -u hdfs hadoop fs -chown -R solr /user/$USER/outdir/results
    3. Move the two index shards into place.
        Note: You are moving the index shards to the two servers you set up in Preparing to Index Data.
      $ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00000/data/index \
      /solr/collection3/$HOSTNAME1/data/
      $ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00001/data/index \
      /solr/collection3/$HOSTNAME2/data/
  8. Start Solr on each node of the cluster:
    $ sudo service solr-server start
  9. Run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection3/select?q=*%3A*&wt=json&indent=true
Page generated September 3, 2015.