Indexing data with MapReduceIndexerTool in Solr backup format

MapReduceIndexerTool (MRIT) is capable of batch indexing a dataset and provide the output in the format of Solr backups, using morphlines. This backup can then be ingested into Solr using a backup opration.

The MapReduceIndexerTool (MRIT) backup format feature addresses the dilemma of ingesting indexes produced by MRIT jobs into Solr:

  • Near-real-time (NRT) ingestion using the --go-live option is resource-intensive and involves merging indexes.
  • Batch indexing requires shutting down the Solr server.
MRIT backup format takes the best of both worlds: by creating the index in the Solr backup format, it can be ingested into Solr as a restore operation, using the solrctl command line utility. This method is significantly less resource intensive on the part of Solr compared to NRT with --go-live. Restoring the backup results in a new collection which can be queried directly or put behind an alias.
  1. To perform a batch indexing job on MRIT with the output in Solr backup format, run the following command:
    hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar  org.apache.solr.hadoop.MapReduceIndexerTool --morphline-file [***MORPHLINE_FILE***]  --output-dir "[***ABSOLUTE/PATH/TO/OUTPUT/DIRECTORY***]" --use-backup-format --backup-name [***USER_SPECIFIED_NAME_FOR_THE_BACKUP***] --zk-host [***HOSTNAME***]:2181/solr --collection [***COLLECTION_NAME***] "[***ABSOLUTE/PATH/TO/INPUT/FILE***]" 
    Replace [***MORPHLINE_FILE***], [***ABSOLUTE/PATH/TO/OUTPUT/DIRECTORY***], [***USER_SPECIFIED_NAME_FOR_THE_BACKUP***], [***HOSTNAME***], [***COLLECTION_NAME***], and [***ABSOLUTE/PATH/TO/INPUT/FILE***] with values applicable in your environment.
    For example:

    To parse the contents of hdfs://ns1:8020/tmp/inputfile using the morphline file morphlines.conf and write the resulting index to hdfs://ns1:8020/tmp/output/results/backupName:

    hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar  org.apache.solr.hadoop.MapReduceIndexerTool --morphline-file morphlines.conf  --output-dir "hdfs://ns1:8020/tmp/output" --use-backup-format --backup-name backupName --zk-host zk-server:2181/solr --collection collection "hdfs://ns1:8020/tmp/inputfile" 
  2. To create a new collection with the contents of the backup:
    solrctl collection --restore [***USER_DEFINED_COLLECTION_NAME***] -b [***NAME_OF_THE_INDEX_IN_BACKUP_FORMAT***] -l [***ABSOLUTE/PATH/TO/RESTORE/TARGET/DIRECTORY***] -i [***REQUEST_ID***]

    Make sure that you use a unique <requestID> each time you run this command.

    Replace [***USER_DEFINED_COLLECTION_NAME***], [***NAME_OF_THE_INDEX_IN_BACKUP_FORMAT***], [***ABSOLUTE/PATH/TO/RESTORE/TARGET/DIRECTORY***] with values applicable in your environment.

    For example:

    To create the collection finalcollectionName from the backup backupName to the directory hdfs://ns1:8020/tmp/output/results with the request ID 1234:

    solrctl collection --restore finalcollectionName -b backupName -l hdfs://ns1:8020/tmp/output/results -i 1234
  3. To monitor the status of the restore step, run the following command:
    solrctl collection --request-status [***REQUEST_ID***]

    Replace [***REQUEST_ID***] with the ID of the task you want to monitor.

    For example:
    solrctl collection --request-status 1234