Indexing data with MapReduceIndexerTool in Solr backup format
MapReduceIndexerTool (MRIT) is capable of batch indexing a dataset and provide the
output in the format of Solr backups, using morphlines. This backup can then be ingested
into Solr using a backup operation.
The MapReduceIndexerTool (MRIT) backup format feature addresses the dilemma of
ingesting indexes produced by MRIT jobs into Solr:
Near-real-time (NRT) ingestion using the --go-live option is
resource-intensive and involves merging indexes.
Batch indexing requires shutting down the Solr server.
MRIT backup format takes the best of both worlds: by creating the index in the Solr
backup format, it can be ingested into Solr as a restore operation, using the
solrctl command line utility. This method is significantly less
resource intensive on the part of Solr compared to NRT with --go-live.
Restoring the backup results in a new collection which can be queried directly or put
behind an alias.
To perform a batch indexing job on MRIT with the output in Solr backup format,
run the following command:
Replace
[***MORPHLINE_FILE***],
[***ABSOLUTE/PATH/TO/OUTPUT/DIRECTORY***],
[***USER_SPECIFIED_NAME_FOR_THE_BACKUP***],
[***HOSTNAME***],
[***COLLECTION_NAME***], and
[***ABSOLUTE/PATH/TO/INPUT/FILE***] with values
applicable in your environment.
For example:
To parse the contents of
hdfs://ns1:8020/tmp/inputfile using the morphline file
morphlines.conf and write the resulting index to
hdfs://ns1:8020/tmp/output/results/backupName:
Make sure that you use a unique <requestID> each
time you run this command.
Replace [***USER_DEFINED_COLLECTION_NAME***],
[***NAME_OF_THE_INDEX_IN_BACKUP_FORMAT***],
[***ABSOLUTE/PATH/TO/RESTORE/TARGET/DIRECTORY***]
with values applicable in your environment.
For example:
To create the collection
finalcollectionName from the backup
backupName to the directory
hdfs://ns1:8020/tmp/output/results with the request ID
1234: