HBaseMapReduceIndexerTool command line reference
Command line syntax, examples and list of parameters.
command [genericOptions] [commandOptions]
usage:hadoop [GenericOptions]... jar hbase-indexer-mr-*-job.jar
[--hbase-indexer-zk STRING] [--hbase-indexer-name STRING]
[--hbase-indexer-file FILE]
[--hbase-indexer-component-factory STRING]
[--hbase-table-name STRING] [--hbase-start-row BINARYSTRING]
[--hbase-end-row BINARYSTRING] [--hbase-start-time STRING]
[--hbase-end-time STRING] [--hbase-timestamp-format STRING]
[--zk-host STRING] [--go-live] [--collection STRING]
[--go-live-min-replication-factor INTEGER]
[--go-live-threads INTEGER] [--go-live-timeout INTEGER]
[--filesystem STRING] [--private-key FILE]
[--known-hosts FILE] [--local-merge-dir DIR]
[--keytab FILE] [--help] [--output-dir HDFS_URI]
[--overwrite-output-dir] [--morphline-file FILE]
[--morphline-id STRING] [--solr-home-dir DIR]
[--update-conflict-resolver FQCN] [--reducers INTEGER]
[--max-segments INTEGER] [--fair-scheduler-pool STRING] [--dry-run]
[--log4j FILE] [--verbose] [--clear-index] [--show-non-solr-cloud]
Examples:hadoop --config /etc/hadoop/conf \
jar hbase-indexer-mr-*-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
-D 'mapreduce.job.user.classpath.first=true' \
-Dmapreduce.map.java.opts="-Xmx512m" \
-Dmapreduce.reduce.java.opts="-Xmx512m" \
--hbase-indexer-file indexer.xml \
--zk-host 127.0.0.1/solr \
--collection collection1 \
--go-live \
--log4j src/test/resources/log4j.properties
hadoop --config /etc/hadoop/conf \
jar hbase-indexer-mr-*-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--libjars /path/to/kite-morphlines-twitter-0.10.0.jar \
-D 'mapreduce.job.user.classpath.first=true' \
-Dmapreduce.map.java.opts="-Xmx512m" \
-Dmapreduce.reduce.java.opts="-Xmx512m" \
--hbase-indexer-file src/test/resources/morphline_indexer_without_zk.xml \
--zk-host 127.0.0.1/solr \
--collection collection1 \
--go-live \
--morphline-file src/test/resources/morphlines.conf \
--output-dir hdfs://c2202.mycompany.com/user/$USER/test \
--overwrite-output-dir \
--log4j src/test/resources/log4j.properties
hadoop --config /etc/hadoop/conf \
jar hbase-indexer-mr-*-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
-D 'mapreduce.job.user.classpath.first=true' \
-Dmapreduce.map.java.opts="-Xmx512m" \
-Dmapreduce.reduce.java.opts="-Xmx512m" \
--hbase-indexer-file indexer.xml \
--zk-host 127.0.0.1/solr \
--collection collection1 \
--go-live \
--log4j src/test/resources/log4j.properties
hadoop --config /etc/hadoop/conf \
jar hbase-indexer-mr-*-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
-D 'mapreduce.job.user.classpath.first=true' \
-Dmapreduce.map.java.opts="-Xmx512m" \
-Dmapreduce.reduce.java.opts="-Xmx512m" \
--hbase-indexer-file indexer.xml \
--zk-host 127.0.0.1/solr \
--collection collection1 \
--reducers 0 \
--log4j src/test/resources/log4j.properties
hadoop --config /etc/hadoop/conf \
jar hbase-indexer-mr-*-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
-D 'mapreduce.job.user.classpath.first=true' \
-Dmapreduce.map.java.opts="-Xmx512m" \
-Dmapreduce.reduce.java.opts="-Xmx512m" \
--hbase-indexer-zk zk01 \
--hbase-indexer-name docindexer \
--go-live \
--log4j src/test/resources/log4j.properties
HADOOP_CLIENT_OPTS='-DmaxConnectionsPerHost=10000 -DmaxConnections=10000'; \
hadoop --config /etc/hadoop/conf \
jar hbase-indexer-mr-*-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
-D 'mapreduce.map.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \
-D 'mapreduce.reduce.java.opts=-DmaxConnectionsPerHost=10000 -DmaxConnections=10000' \
--hbase-indexer-zk zk01 \
--hbase-indexer-name docindexer \
--go-live \
--log4j src/test/resources/log4j.properties
HBase Indexer Parameters
Parameters for specifying the HBase indexer definition and/or where it should be loaded from.
Parameter | Type | Description | Example |
---|---|---|---|
--hbase-indexer-zk |
STRING | The address of the ZooKeeper ensemble from which to fetch the indexer definition named --hbase-indexer-name. Format is: a list of comma separated host:port pairs, each corresponding to a zk server. | '127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183' |
--hbase-indexer-name |
STRING | The name of the indexer configuration to fetch from the ZooKeeper
ensemble specified with --hbase-indexer-zk . |
myIndexer |
--hbase-indexer-file |
FILE | Relative or absolute path to a local HBase indexer XML configuration
file. If supplied, this overrides --hbase-indexer-zk and
--hbase-indexer-name . |
/path/to/morphline-hbase-mapper.xml |
--hbase-indexer-component-factory |
STRING | Classname of the hbase indexer component factory. |
HBase Scan Parameters
Parameters for specifying what data is included while reading from HBase.
Parameter | Type | Description | Example |
---|---|---|---|
--hbase-table-name |
STRING | Optional name of the HBase table containing the records to be indexed. If
supplied, this overrides the value from the --hbase-indexer-*
options. |
myTable |
--hbase-start-row |
BINARYSTRING | Binary string representation of start row from which to start indexing
(inclusive). The format of the supplied row key should use two-digit hex values
prefixed by \x for non-ascii characters (e.g. 'row\x00' ). The
semantics of this argument are the same as those for the HBase
Scan#setStartRow method. The default is to include the first
row of the table. |
AAAA |
--hbase-end-row |
BINARYSTRING | Binary string representation of end row prefix at which to stop indexing
(exclusive). See the description of --hbase-start-row for more
information. The default is to include the last row of the table. |
CCCC |
--hbase-start-time |
STRING | Earliest timestamp (inclusive) in time range of HBase cells to be included for indexing. The default is to include all cells. | 0 |
--hbase-end-time |
STRING | Latest timestamp (exclusive) of HBase cells to be included for indexing. The default is to include all cells. | 123456789 |
--hbase-timestamp-format |
STRING |
Timestamp format to be used to interpret |
yyyy-MM-dd'T'HH:mm:ss.SSSZ |
Solr Cluster Arguments
Arguments that provide information about your Solr cluster.
Argument | Type | Description | Example |
---|---|---|---|
--zk-host |
STRING |
The address of a ZooKeeper ensemble being used by a SolrCloud cluster. This
ZooKeeper ensemble will be examined to determine the number of output shards to
create as well as the Solr URLs to merge the output shards into when using the
The format is a list of comma separated host:port pairs, each corresponding to a zk server. The If |
'127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183' If the
optional
where the client would be rooted at |
--solr-client-socket-timeout |
INTEGER |
Solr socket timeout in milliseconds This optional argument overwrites the default 10 minute socket timeout in HBase
indexer for the direct writing mode (when the value of the
Default value: 600000 |
Go Live Arguments
Arguments for merging the shards that are built into a live Solr cluster. Also see the Cluster arguments.
Argument | Type | Description | Example |
---|---|---|---|
--go-live |
Allows you to optionally merge the final index shards into a live Solr cluster
after they are built. You can pass the ZooKeeper address with
--zk-host and the relevant cluster information will be auto
detected. (default: |
||
--collection |
STRING | The SolrCloud collection to merge shards into when using
--go-live and --zk-host . |
collection1 |
--go-live-min-replication-factor |
INTEGER | The minimum number of SolrCloud replicas to successfully merge any final index
shard into. The go-live job phase attempts to merge final index shards into all
SolrCloud replicas. Some of these merge operations may fail, for example if some
SolrCloud servers are down. This option enables indexing jobs to succeed even if
some such merge operations fail on SolrCloud followers. Successful merge operations
into all leaders are always required for job success, regardless of the value of
--go-live-min- replication-factor . -1 indicates require
successful merge operations into all replicas. 1 indicates require successful merge
operations only into leader replicas. (default: -1) |
|
--go-live-threads |
INTEGER | Tuning knob that indicates the maximum number of live merges to run in parallel
at one time. (default: 1000) |
|
--go-live-timeout |
INTEGER | Timeout in milliseconds (ms) to wait for the merge to complete before the connection times out and the tool fails. | |
--filesystem |
STRING | Allows you to change whether you want to merge indexes into a collection on
HDFS or on localfs. Possible values are:
localfs make sure that the target collection uses
localfs. This option also requires the --use-zk-solrconfig.xml and
--private-key arguments. |
|
--private-key |
FILE | Path to the private key that allows the user running the MRIT job to SSH into
all Solr hosting nodes. This must be provided together with the --filesystem
localfs argument. |
|
--known-hosts |
FILE | Path to the known hosts file which contains keys to all the Solr hosting nodes.
Use with the --filesystem localfs argument. |
|
--local-merge-dir |
DIR | Path to a directory on all Solr hosts to temporarily copy index to before
mergeindexes action. Used only with the --filesystem localfs
option. The user running the MRIT job needs to have permission to read and write
into this directory. This directory needs to be relative to SOLR_HOME
or SOLR_DATA_HOME or needs to be specified in the
Solr system property -Dsolr.allowPaths .(default:
|
|
--keytab |
FILE | Path to the keytab file for the user running the MRIT job on all Solr hosting
nodes. Used only with the --filesystem localfs argument. If not
provided, kinit is skipped and MRIT expects an unsecure
environment or a valid kerberos ticket already present on all Solr hosting nodes for
the current user. |
Argument | Type | Description | Example |
---|---|---|---|
|
Show the help message and exit | ||
--output-dir |
HDFS_URI | HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated. | hdfs://c2202.mycompany. com/user/$USER/test |
--overwrite-output-dir |
Overwrite the directory specified by --output-dir if it
already exists. Using this parameter will result in the output directory being
recursively deleted at job startup.(default: false) |
||
--morphline-file |
FILE | Relative or absolute path to a local config file that contains one or more
morphlines. The file must be UTF-8 encoded. The file will be uploaded to each MR
task. If supplied, this overrides the value from the
--hbase-indexer-* options. |
/path/to/morphlines.conf |
--morphline-id |
STRING | The identifier of the morphline that shall be executed within the morphline
config file, e.g. specified by --morphline-file . If the
--morphline- id option is ommitted the first (i.e. top-most)
morphline within the config file is used. If supplied, this overrides the value from
the --hbase-indexer-* options. |
morphline1 |
--solr-home-dir |
DIR | Optional relative or absolute path to a local dir containing Solr
conf/ dir and in particular conf/solrconfig.xml
and optionally also lib/ dir. This directory will be uploaded to
each MR task. |
src/test/resources/solr/minimr |
--update-conflict-resolver |
FQCN | Fully qualified class name of a Java class that implements the
UpdateConflictResolver interface. This enables deduplication and ordering of a series of document updates for the same unique document key. For example, a MapReduce batch job might index multiple files in the same job where some of the files contain old and new versions of the very same document, using the same unique document key. Typically, implementations of this interface
forbid collisions by throwing an exception, or ignore all but the most recent
document version, or, in the general case, order colliding updates ascending from
least recent to most recent (partial) update. The caller of this interface (i. e.
the Hadoop Reducer) will then apply the updates to Solr in the order returned by
the The default
(default: |
|
--reducers |
INTEGER | Tuning knob that indicates the number of reducers to index into.
(default: -1) |
|
--max-segments |
INTEGER |
Tuning knob that indicates the maximum number of segments to be contained on output in the index of each reducer shard. After a reducer has built its output index it applies a merge policy to merge
segments until there are <= Set In a nutshell, a small (default: 1) |
|
--dry-run |
Run in local mode and print documents to stdout instead of loading them into
Solr. This executes the morphline in the client process (without submitting a job to
MR) for quicker turnaround during early trial and debug sessions. (default: false) |
||
--log4j |
FILE | Relative or absolute path to a log4j.properties config file on
the local file system. This file will be uploaded to each MR task. |
/path/to/log4j.properties |
|
Turn on verbose output. (default: false) |
||
--clear-index |
Will attempt to delete all entries in a solr index before starting batch build.
This is not transactional so if the build fails the index will be empty.
(default: false) |
||
--show-non-solr-cloud |
Also show options for Non-SolrCloud mode as part of --help .
(default: false) |
Supported Generic Options
The following generic options are supported:
Option | Description |
---|---|
--conf <configuration file> |
Specify an application configuration file. |
-D <property=value> |
Define a value for a given property. |
-fs <file:///|hdfs://namenode:port> |
Specify default filesystem URL to use, overrides the
fs.defaultFS property from configurations. |
--jt <local|resourcemanager:port> |
Specify a ResourceManager. |
--files <file1,...> |
Specify a comma-separated list of files to be copied to the map reduce cluster. |
--libjars <jar1,...> |
Specify a comma-separated list of jar files to be included in the classpath. |
--archives <archive1,...> |
Specify a comma-separated list of archives to be unarchived on the compute machines. |