HBaseMapReduceIndexerTool command line reference

important

If Solr uses a custom system user, you need to pass this username to MapReduceIndexerTool jobs using the following argument:

-Dsolr.authorization.superuser=<<customSolrSystemUser>>

This is necessary for setting up ACLs of the MapReduce jobs directory correctly.

The general command line syntax is:


command [genericOptions] [commandOptions]

usage:

hadoop [GenericOptions]... jar hbase-indexer-mr-*-job.jar
       [--hbase-indexer-zk STRING] [--hbase-indexer-name STRING]
       [--hbase-indexer-file FILE]
       [--hbase-indexer-component-factory STRING]
       [--hbase-table-name STRING] [--hbase-start-row BINARYSTRING]
       [--hbase-end-row BINARYSTRING] [--hbase-start-time STRING]
       [--hbase-end-time STRING] [--hbase-timestamp-format STRING]
       [--zk-host STRING] [--go-live] [--collection STRING]
       [--go-live-min-replication-factor INTEGER]
       [--go-live-threads INTEGER] [--go-live-timeout INTEGER]
       [--filesystem STRING] [--private-key FILE] 
       [--known-hosts FILE] [--local-merge-dir DIR] 
       [--keytab FILE] [--help] [--output-dir HDFS_URI]
       [--overwrite-output-dir] [--morphline-file FILE]
       [--morphline-id STRING] [--solr-home-dir DIR]
       [--update-conflict-resolver FQCN] [--reducers INTEGER]
       [--max-segments INTEGER] [--fair-scheduler-pool STRING] [--dry-run]
       [--log4j FILE] [--verbose] [--clear-index] [--show-non-solr-cloud]

Examples:

(Re)index a table in GoLive mode based on a local indexer config file:

hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapreduce.job.user.classpath.first=true' \
  -Dmapreduce.map.java.opts="-Xmx512m" \
  -Dmapreduce.reduce.java.opts="-Xmx512m" \
  --hbase-indexer-file indexer.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --go-live \
  --log4j src/test/resources/log4j.properties

(Re)index a table in GoLive mode using a local morphline-based indexer config file. Also include extra library jar file containing JSON tweet Java parser:

hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  --libjars /path/to/kite-morphlines-twitter-0.10.0.jar \
  -D 'mapreduce.job.user.classpath.first=true' \
  -Dmapreduce.map.java.opts="-Xmx512m" \
  -Dmapreduce.reduce.java.opts="-Xmx512m" \
  --hbase-indexer-file src/test/resources/morphline_indexer_without_zk.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --go-live \
  --morphline-file src/test/resources/morphlines.conf \
  --output-dir hdfs://c2202.mycompany.com/user/$USER/test \
  --overwrite-output-dir \
  --log4j src/test/resources/log4j.properties

(Re)index a table in GoLive mode:

hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapreduce.job.user.classpath.first=true' \
  -Dmapreduce.map.java.opts="-Xmx512m" \
  -Dmapreduce.reduce.java.opts="-Xmx512m" \
  --hbase-indexer-file indexer.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --go-live \
  --log4j src/test/resources/log4j.properties

(Re)index a table with direct writes to SolrCloud:

hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapreduce.job.user.classpath.first=true' \
  -Dmapreduce.map.java.opts="-Xmx512m" \
  -Dmapreduce.reduce.java.opts="-Xmx512m" \
  --hbase-indexer-file indexer.xml \
  --zk-host 127.0.0.1/solr \
  --collection collection1 \
  --reducers 0 \
  --log4j src/test/resources/log4j.properties

(Re)index a table based on a indexer config stored in ZK:

hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapreduce.job.user.classpath.first=true' \
  -Dmapreduce.map.java.opts="-Xmx512m" \
  -Dmapreduce.reduce.java.opts="-Xmx512m" \
  --hbase-indexer-zk zk01 \
  --hbase-indexer-name docindexer \
  --go-live \
  --log4j src/test/resources/log4j.properties

MapReduce on Yarn - Pass custom JVM arguments:

HADOOP_CLIENT_OPTS='-DmaxConnectionsPerHost=10000 -DmaxConnections=10000'; \
hadoop --config /etc/hadoop/conf \
  jar hbase-indexer-mr-*-job.jar \
  --conf /etc/hbase/conf/hbase-site.xml \
  -D 'mapreduce.map.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \
  -D 'mapreduce.reduce.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \
  --hbase-indexer-zk zk01 \
  --hbase-indexer-name docindexer \
  --go-live \
  --log4j src/test/resources/log4j.properties

HBase Indexer Parameters

Parameters for specifying the HBase indexer definition and/or where it should be loaded from.

Table 1. HBase Indexer parameters
Parameter	Type	Description	Example
`--hbase-indexer-zk`	STRING	The address of the ZooKeeper ensemble from which to fetch the indexer definition named --hbase-indexer-name. Format is: a list of comma separated host:port pairs, each corresponding to a zk server.	`'127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183'`
`--hbase-indexer-name`	STRING	The name of the indexer configuration to fetch from the ZooKeeper ensemble specified with `--hbase-indexer-zk`.	`myIndexer`
`--hbase-indexer-file`	FILE	Relative or absolute path to a local HBase indexer XML configuration file. If supplied, this overrides `--hbase-indexer-zk` and `--hbase-indexer-name`.	`/path/to/morphline-hbase-mapper.xml`
`--hbase-indexer-component-factory`	STRING	Classname of the hbase indexer component factory.

HBase Scan Parameters

Parameters for specifying what data is included while reading from HBase.

Table 2. HBase scan parameters
Parameter	Type	Description	Example
`--hbase-table-name`	STRING	Optional name of the HBase table containing the records to be indexed. If supplied, this overrides the value from the `--hbase-indexer-*` options.	`myTable`
`--hbase-start-row`	BINARYSTRING	Binary string representation of start row from which to start indexing (inclusive). The format of the supplied row key should use two-digit hex values prefixed by \x for non-ascii characters (e.g. `'row\x00'`). The semantics of this argument are the same as those for the HBase `Scan#setStartRow` method. The default is to include the first row of the table.	`AAAA`
`--hbase-end-row`	BINARYSTRING	Binary string representation of end row prefix at which to stop indexing (exclusive). See the description of `--hbase-start-row` for more information. The default is to include the last row of the table.	`CCCC`
`--hbase-start-time`	STRING	Earliest timestamp (inclusive) in time range of HBase cells to be included for indexing. The default is to include all cells.	`0`
`--hbase-end-time`	STRING	Latest timestamp (exclusive) of HBase cells to be included for indexing. The default is to include all cells.	`123456789`
`--hbase-timestamp-format`	STRING	Timestamp format to be used to interpret `--hbase- start-time` and `--hbase-end-time`. This is a `java.text.SimpleDateFormat` compliant format. If this parameter is omitted then the timestamps are interpreted as number of milliseconds since the standard epoch (Unix time).	`yyyy-MM-dd'T'HH:mm:ss.SSSZ`

Solr Cluster Arguments

Arguments that provide information about your Solr cluster.

Table 3. Solr cluster arguments
Argument	Type	Description	Example
`--zk-host`	STRING	The address of a ZooKeeper ensemble being used by a SolrCloud cluster. This ZooKeeper ensemble will be examined to determine the number of output shards to create as well as the Solr URLs to merge the output shards into when using the `--go-live` option. Requires that you also pass the `--collection` to merge the shards into. The format is a list of comma separated host:port pairs, each corresponding to a zk server. The `--zk-host` option implements the same partitioning semantics as the standard SolrCloud Near-Real-Time (NRT) API. This enables to mix batch updates from MapReduce ingestion with updates from standard Solr NRT ingestion on the same SolrCloud cluster, using identical unique document keys. If `--solr-home-dir` is not specified, the Solr home directory for the collection will be downloaded from this ZooKeeper ensemble.	`'127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183'` If the optional `chroot` suffix is used the example would look like: `'127.0.0.1:2181/solr127.0.0.1:2182/solr127.0.0.1:2183/solr'` where the client would be rooted at `'/solr'` and all paths would be relative to this root - i.e.: getting/setting/etc... '/foo/bar' would result in operations being run on '/solr/foo/bar' (from the server perspective).
`--solr-client-socket-timeout`	INTEGER	Solr socket timeout in milliseconds This optional argument overwrites the default 10 minute socket timeout in HBase indexer for the direct writing mode (when the value of the `--reducers` optional argument is set to `0` and mappers directly send the data to the live Solr). Default value: 600000

Go Live Arguments

Arguments for merging the shards that are built into a live Solr cluster. Also see the Cluster arguments.

Table 4. Go live arguments
Argument	Type	Description	Example
`--go-live`		Allows you to optionally merge the final index shards into a live Solr cluster after they are built. You can pass the ZooKeeper address with `--zk-host` and the relevant cluster information will be auto detected. (default: `false`)
`--collection`	STRING	The SolrCloud collection to merge shards into when using `--go-live` and `--zk-host`.	`collection1`
`--go-live-min-replication-factor`	INTEGER	The minimum number of SolrCloud replicas to successfully merge any final index shard into. The go-live job phase attempts to merge final index shards into all SolrCloud replicas. Some of these merge operations may fail, for example if some SolrCloud servers are down. This option enables indexing jobs to succeed even if some such merge operations fail on SolrCloud followers. Successful merge operations into all leaders are always required for job success, regardless of the value of `--go-live-min- replication-factor`. -1 indicates require successful merge operations into all replicas. 1 indicates require successful merge operations only into leader replicas. (default: -1)
`--go-live-threads`	INTEGER	Tuning knob that indicates the maximum number of live merges to run in parallel at one time. (default: 1000)
`--go-live-timeout`	INTEGER	Timeout in milliseconds (ms) to wait for the merge to complete before the connection times out and the tool fails.
`--filesystem`	STRING	Allows you to change whether you want to merge indexes into a collection on HDFS or on localfs. Possible values are: `hdfs` (default) `localfs` In case of `localfs` make sure that the target collection uses localfs. This option also requires the `--use-zk-solrconfig.xml` and `--private-key` arguments.
`--private-key`	FILE	Path to the private key that allows the user running the MRIT job to SSH into all Solr hosting nodes. This must be provided together with the `--filesystem localfs` argument.
`--known-hosts`	FILE	Path to the known hosts file which contains keys to all the Solr hosting nodes. Use with the `--filesystem localfs` argument. important Always specify this in `localfs` mode in production use. If not provided, strict host checking is skipped. This is unsafe and Cloudera only recommends it for testing purposes.
`--local-merge-dir`	DIR	Path to a directory on all Solr hosts to temporarily copy index to before mergeindexes action. Used only with the `--filesystem localfs` option. The user running the MRIT job needs to have permission to read and write into this directory. This directory needs to be relative to SOLR_HOME or SOLR_DATA_HOME or needs to be specified in the Solr system property `-Dsolr.allowPaths`. (default: `/var/lib/solr`)
`--keytab`	FILE	Path to the keytab file for the user running the MRIT job on all Solr hosting nodes. Used only with the `--filesystem localfs` argument. If not provided, kinit is skipped and MRIT expects an unsecure environment or a valid kerberos ticket already present on all Solr hosting nodes for the current user.

Table 5. Optional arguments
Argument	Type	Description	Example
`--help` `-help` `-h`		Show the help message and exit
`--output-dir`	HDFS_URI	HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated.	`hdfs://c2202.mycompany. com/user/$USER/test`
`--overwrite-output-dir`		Overwrite the directory specified by `--output-dir` if it already exists. Using this parameter will result in the output directory being recursively deleted at job startup. (default: false)
`--morphline-file`	FILE	Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8 encoded. The file will be uploaded to each MR task. If supplied, this overrides the value from the `--hbase-indexer-*` options.	`/path/to/morphlines.conf`
`--morphline-id`	STRING	The identifier of the morphline that shall be executed within the morphline config file, e.g. specified by `--morphline-file`. If the `--morphline- id` option is ommitted the first (i.e. top-most) morphline within the config file is used. If supplied, this overrides the value from the `--hbase-indexer-*` options.	`morphline1`
`--solr-home-dir`	DIR	Optional relative or absolute path to a local dir containing Solr `conf/` dir and in particular `conf/solrconfig.xml` and optionally also `lib/` dir. This directory will be uploaded to each MR task.	`src/test/resources/solr/minimr`
`--update-conflict-resolver`	FQCN	Fully qualified class name of a Java class that implements the `UpdateConflictResolver` interface. This enables deduplication and ordering of a series of document updates for the same unique document key. For example, a MapReduce batch job might index multiple files in the same job where some of the files contain old and new versions of the very same document, using the same unique document key. Typically, implementations of this interface forbid collisions by throwing an exception, or ignore all but the most recent document version, or, in the general case, order colliding updates ascending from least recent to most recent (partial) update. The caller of this interface (i. e. the Hadoop Reducer) will then apply the updates to Solr in the order returned by the `orderUpdates()` method. The default `RetainMostRecentUpdateConflictResolver` implementation ignores all but the most recent document version, based on a configurable numeric Solr field, which defaults to the `file_last_modified` timestamp. (default: `org.apache. solr.hadoop.dedup. RetainMostRecentUpdateConflictResolver`)
`--reducers`	INTEGER	Tuning knob that indicates the number of reducers to index into. 0 indicates that no reducers should be used, and documents should be sent directly from the mapper tasks to live Solr servers. -1 indicates use all reduce slots available on the cluster. -2 indicates use one reducer per output shard, which disables the mtree merge MR algorithm. The mtree merge MR algorithm improves scalability by spreading load (in particular CPU load) among a number of parallel reducers that can be much larger than the number of solr shards expected by the user. It can be seen as an extension of concurrent lucene merges and tiered lucene merges to the clustered case. The subsequent mapper-only phase merges the output of said large number of reducers to the number of shards expected by the user, again by utilizing more available parallelism on the cluster. (default: -1)
`--max-segments`	INTEGER	Tuning knob that indicates the maximum number of segments to be contained on output in the index of each reducer shard. After a reducer has built its output index it applies a merge policy to merge segments until there are <= `maxSegments` lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set `maxSegments` to `1` to optimize the index for low query latency. In a nutshell, a small `maxSegments` value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1)
`--dry-run`		Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trial and debug sessions. (default: false)
`--log4j`	FILE	Relative or absolute path to a `log4j.properties` config file on the local file system. This file will be uploaded to each MR task.	`/path/to/log4j.properties`
`--verbose` `-v`		Turn on verbose output. (default: false)
`--clear-index`		Will attempt to delete all entries in a solr index before starting batch build. This is not transactional so if the build fails the index will be empty. (default: false)
`--show-non-solr-cloud`		Also show options for Non-SolrCloud mode as part of `--help`. (default: false)

Supported Generic Options

The following generic options are supported:

Table 6. Supported generic options
Option	Description
`--conf <configuration file>`	Specify an application configuration file.
`-D <property=value>`	Define a value for a given property.
`-fs <file:///\|hdfs://namenode:port>`	Specify default filesystem URL to use, overrides the `fs.defaultFS` property from configurations.
`--jt <local\|resourcemanager:port>`	Specify a ResourceManager.
`--files <file1,...>`	Specify a comma-separated list of files to be copied to the map reduce cluster.
`--libjars <jar1,...>`	Specify a comma-separated list of jar files to be included in the classpath.
`--archives <archive1,...>`	Specify a comma-separated list of archives to be unarchived on the compute machines.