MapReduceIndexerTool
MapReduceIndexerTool is a MapReduce batch job driver that takes a morphline and creates a set of Solr index shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner.
For more information on Morphlines, see:
- Extracting, Transforming, and Loading Data With Cloudera Morphlines for an introduction to Morphlines.
- Example Morphline Usage for morphline examples, discussion of those examples, and links to additional information.
The indexer creates an offline index on HDFS in the output directory
specified by the --output-dir
parameter. If the
--go-live
parameter is specified, Solr merges the
resulting offline index into the live running service. Thus, the Solr
service must have read access to the contents of the output directory to
complete the go-live step. In an environment with restrictive
permissions, such as one with an HDFS umask of 077, the Solr user may not
be able to read the contents of the newly created directory. To address
this issue, the indexer automatically applies the HDFS ACLs to enable Solr
to read the output directory contents. These ACLs are only applied if HDFS
ACLs are enabled on the HDFS NameNode. For more information, see HDFS ACLs.
The indexer only makes ACL updates to the output directory and its contents. If the output directory's parent directories do not include the run permission, the Solr service is not be able to access the output directory. Solr must have run permissions from standard permissions or ACLs on the parent directories of the output directory.