Using MapReduce Batch Indexing with Cloudera Search
The following sections include examples that illustrate using MapReduce to index tweets. These examples require that you:
- Complete the process of Preparing to Index Data with Cloudera Search.
- Install the MapReduce tools for Cloudera Search as described in Installing MapReduce Tools for use with Cloudera Search.
Batch Indexing into Online Solr Servers Using GoLive
MapReduceIndexerTool is a MapReduce batch job driver that creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner. Using GoLive, MapReduceIndexerTool also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud. The following sample steps demonstrate these capabilities.
- Delete all existing documents in Solr.
$ solrctl collection --deletedocs collection1
- Run the MapReduce job using GoLive. Replace $NNHOST and $ZKHOST in the command with your NameNode and ZooKeeper host names
and port numbers, as required. You do not need to specify --solr-home-dir because the job accesses it from ZooKeeper.
- Parcel-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.MAPREDUCE-1 jar \ /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \ --zk-host $ZKHOST:2181/solr --collection collection1 \ hdfs://$NNHOST:8020/user/$USER/indir
- Package-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.MAPREDUCE-1 jar \ /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --go-live \ --zk-host $ZKHOST:2181/solr --collection collection1 \ hdfs://$NNHOST:8020/user/$USER/indir
- Parcel-based Installation:
- Check the job tracker status at http://localhost:50030/jobtracker.jsp.
- When the job is complete, run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true
For help on how to run a Hadoop MapReduce job, use the following command:
- Parcel-based Installation:
$ hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool --help
- Package-based Installation:
$ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool --help
- Parcel-based Installation:
Batch Indexing into Offline Solr Shards
Running the MapReduce job without GoLive causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.
Batch indexing into offline Solr shards is mainly intended for offline use-cases by experts. Cases requiring read-only indexes for searching can be handled using batch indexing without the --go-live option. By not using GoLive, you can avoid copying datasets between segments, thereby reducing resource demands.
- Delete all existing documents in Solr.
$ solrctl collection --deletedocs collection1 $ sudo -u hdfs hadoop fs -rm -r -skipTrash /user/$USER/outdir
- Run the Hadoop MapReduce job, replacing $NNHOST in the command with your NameNode hostname and port number, as required.
- Parcel-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.MAPREDUCE-1 jar \ /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \ $HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
- Package-based Installation:
$ hadoop --config /etc/hadoop/conf.cloudera.MAPREDUCE-1 jar \ /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ 'mapred.child.java.opts=-Xmx500m' --log4j \ /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file \ /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \ $HOME/collection1 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
- Parcel-based Installation:
- Check the job tracker status. For example, for the localhost, use http://localhost:50030/jobtracker.jsp.
- After the job is completed, check the generated index files. Individual shards are written to the results directory with names of the form part-00000,
part-00001, part-00002. There are only two shards in this example.
$ hadoop fs -ls /user/$USER/outdir/results $ hadoop fs -ls /user/$USER/outdir/results/part-00000/data/index
- Stop Solr on each host of the cluster.
$ sudo service solr-server stop
- List the host name folders used as part of the path to each index in the SolrCloud cluster.
$ hadoop fs -ls /solr/collection1
- Move index shards into place.
- Remove outdated files:
$ sudo -u solr hadoop fs -rm -r -skipTrash \ /solr/collection1/$HOSTNAME1/data/index $ sudo -u solr hadoop fs -rm -r -skipTrash \ /solr/collection1/$HOSTNAME2/data/index
- Ensure correct ownership of required directories:
$ sudo -u hdfs hadoop fs -chown -R solr /user/$USER/outdir/results
- Move the two index shards into place (the two servers you set up in Preparing to Index Data with Cloudera
Search):
$ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00000/data/index \ /solr/collection1/$HOSTNAME1/data/ $ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00001/data/index \ /solr/collection1/$HOSTNAME2/data/
- Remove outdated files:
- Start Solr on each host of the cluster:
$ sudo service solr-server start
- Run some Solr queries. For example, for myserver.example.com, use: http://myserver.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true