Command Line Options

Operation Modes

-m MODE, --mode=MODE archive | delete | save

The mode to use depends on the intent. Archive will store data into the desired storage medium and then remove the data after it has been stored, Delete is self explanatory, and Save is just like Archive except that data is not deleted after it has been stored.

Connecting to Solr

-s SOLR_URL, --solr-url=SOLR_URL>

The URL to use to connect to the specific Solr Cloud instance.

For example:

‘http://c6401.ambari.apache.org:8886/solr’.

-c COLLECTION, --collection=COLLECTION

The name of the Solr collection. For example: ‘hadoop_logs’

-k SOLR_KEYTAB,--solr-keytab=SOLR_KEYTAB

The keytab file to use when operating against a kerberized Solr instance.

-n SOLR_PRINCIPAL, --solr-principal=SOLR_PRINCIPAL

The principal name to use when operating against a kerberized Solr instance.

Record Schema

-i ID_FIELD, --id-field=ID_FIELD

The name of the field in the solr schema to use as the unique identifier for each record.

-f FILTER_FIELD, --filter-field=FILTER_FIELD

The name of the field in the solr schema to filter off of. For example: 'logtime’

-o DATE_FORMAT, --date-format=DATE_FORMAT

The custom date format to use with the -d DAYS field to match log entries that are older than a certain number of days.

-e END

Based on the filter field and date format, this argument configures the date that should be used as the end of the date range. If you use ‘2018-08-29T12:00:00.000Z’, then any records with a filter field that is after that date will be saved, deleted, or archived depending on the mode.

-d DAYS, --days=DAYS

Based on the filter field and date format, this argument configures the number days before today should be used as the end of the range. If you use ‘30’, then any records with a filter field that is older than 30 days will be saved, deleted, or archived depending on the mode.

-q ADDITIONAL_FILTER, --additional-filter=ADDITIONAL_FILTER

Any additional filter criteria to use to match records in the collection.

Extracting Records

-r READ_BLOCK_SIZE, --read-block-size=READ_BLOCK_SIZE

The number of records to read at a time from Solr. For example: ‘10’ to read 10 records at a time.

-w WRITE_BLOCK_SIZE, --write-block-size=WRITE_BLOCK_SIZE

The number of records to write per output file. For example: ‘100’ to write 100 records per file.

-j NAME, --name=NAME name included in result files

Additional name to add to the final filename created in save or archive mode.

--json-file

Default output format is one valid json document per record delimited by a newline. This option will write out a single valid JSON document containing all of the records.

-z COMPRESSION, --compression=COMPRESSION none | tar.gz | tar.bz2 | zip | gz

Depending on how output files will be analyzed, you have the choice to choose the optimal compression and file format to use for output files. Gzip compression is used by default.

Writing Data to HDFS

-a HDFS_KEYTAB, --hdfs-keytab=HDFS_KEYTAB

The keytab file to use when writing data to a kerberized HDFS instance.

-l HDFS_PRINCIPAL, --hdfs-principal=HDFS_PRINCIPAL

The principal name to use when writing data to a kerberized HDFS instance.

-u HDFS_USER, --hdfs-user=HDFS_USER

The user to connect to HDFS as.

-p HDFS_PATH, --hdfs-path=HDFS_PATH

The path in HDFS to write data to in save or archive mode.

Writing Data to S3

-t KEY_FILE_PATH, --key-file-path=KEY_FILE_PATH

The path to the file on the local file system that contains the AWS Access and Secret Keys. The file should contain the keys in this format: <accessKey>,<secretKey>

-b BUCKET, --bucket=BUCKET

The name of the bucket that data should be uploaded to in save or archive mode.

-y KEY_PREFIX, --key-prefix=KEY_PREFIX

The key prefix allows you to create a logical grouping of the objects in an S3 bucket. The prefix value is similar to a directory name enabling you to store data in the same directory in a bucket. For example, if your Amazon S3 bucket name is logs, and you set prefix to hadoop/, and the file on your storage device is hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz, then the file would be identified by this URL: http://s3.amazonaws.com/logs/hadoop/hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz

-g, --ignore-unfinished-uploading

To deal with connectivity issues, uploading extracted data can be retried. If you do not wish to resume uploads, use the -g flag to disable this behaviour.

Writing Data Locally

-x LOCAL_PATH, --local-path=LOCAL_PATH

The path on the local file system that should be used to write data to in save or archive mode

Examples

Deleting Indexed Data

In delete mode (-m delete), the program deletes data from the Solr collection. This mode uses the filter field (-f FITLER_FIELD) option to control which data should be removed from the index.

To delete log entries from the hadoop_logs collection, which have been created before August 29, 2017, we'll use the -f option to specify the field in the Solr collection to use as a filter field, and the -e option to denote the end of the range of values to remove.

infra-solr-data-manager -m delete -s http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e 2017-08-29T12:00:00.000Z

Archiving Indexed Data

In archive mode, the program fetches data from the Solr collection and writes it out to HDFS or S3, then deletes the data.

The program will fetch records from Solr and creates a file once the write block size is reached, or if there are no more matching records found in Solr. The program keeps track of its progress by fetching the records ordered by the filter field, and the id field, and always saves their last values. Once the file is written, it’s is compressed using the configured compression type.

After the compressed file is created the program creates a command file containing instructions with next steps. In case of any interruptions or error during the next run for the same collection the program will start executing the saved command file, so all the data would be consistent. If the error is due to invalid configuration, and failures persist, the -g option can be used to ignore the saved command file.

The program supports writing data to HDFS, S3, or Local Disk.

Saving Indexed Data

Saving is similar to Archiving data except that the data is not deleted from Solr after the files are created and uploaded. The Save mode is recommended for testing that the data is written as expected before running the program in Archive mode with the same parameters.

Examples

- Archive data from the solr collection hadoop_logs accessible at http://c6401.ambari.apache.org:8886/solr based on the field logtime, save everything older than 1 day, read 10 documents at once, write 100 documents into a file, and copy the zip files into the local directory /tmp. Do this in verbose mode:

infra-solr-data-manager -m archive -s http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -d 1 -r 10 -w 100 -x /tmp -v

- Save the last 3 days of hdfs audit logs into HDFS path "/" with the user hdfs, fetching data from a kerberized Solr:

Infra-solr-data-manager -m save -s http: //c6401.ambari.apache.org:8886/solr -c audit_logs -f logtime -d 3 -r 10 -w 100 -q type:\”hdfs_audit\” -j hdfs_audit -k /etc/security/keytabs/ambari-infra-solr.service.keytab -n infra-solr/c6401.ambari.apache.org@AMBARI.APACHE.ORG -u hdfs -p /

- Delete the data before 2017-08-29T12:00:00.000Z:

infra-solr-data-manager -m delete -s http: //c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e 2017-08-29T12:00:00.000Z

Handle archived data

After the data are archived, one way to deal with them is to load them into Hive tables. Only line delimited json files can be loaded this way, so use --line-delimited option. In order to be able to load the data one must either archive the data uncompressed ( --compression none ), or decompress the files before using them. One must also add a json SerDe before loading these data into the Hive table as the json data must be de-serialized. In the following examples hive-json-serde.jar is used, prior to the table creation it must be added in the Hive shell like this:

ADD JAR <path-to-jar>/hive-json-serde.jar

Here are some suggestions for table schemes for various log types. Using external tables is recommended, as it has the advantage of keeping the archives in HDFS. In the examples below the parts in ()s describe how to create external tables, omitting them would lead to managed hive tables. In order to create external talbes one must create directory in HDFS where the data will be kept:

hadoop fs -mkdir <some directory path>

Hadoop logs

CREATE (EXTERNAL) TABLE hadoop_logs

(

logtime string,

level string,

thread_name string,

logger_name string,

file string,

line_number int,

method string,

log_message string,

cluster string,

type string,

path string,

logfile_line_number int,

host string,

ip string,

id string,

event_md5 string,

message_md5 string,

seq_num int

)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

(LOCATION '<some directory path>');

Audit logs

As audit logs are slightly different, we suggest to archive them separately using --additional-filter, and we offer separate schemas for hdfs, ambari, and ranger audit logs.

- HDFS Audit logs

CREATE (EXTERNAL) TABLE audit_logs_hdfs

(

evtTime string,

level string,

logger_name string,

log_message string,

resource string,

result int,

action string,

cliType string,

req_caller_id string,

ugi string,

reqUser string,

proxyUsers array<string>,

authType string,

proxyAuthType string,

dst string,

perm string,

cluster string,

type string,

path string,

logfile_line_number int,

host string,

ip string,

cliIP string,

id string,

event_md5 string,

message_md5 string,

seq_num int

)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

(LOCATION '<some directory path>');

- Ambari Audit logs

CREATE (EXTERNAL) TABLE audit_logs_ambari

(

evtTime string,

log_message string,

resource string,

result int,

action string,

reason string,

ws_base_url string,

ws_command string,

ws_component string,

ws_details string,

ws_display_name string,

ws_os string,

ws_repo_id string,

ws_repo_version string,

ws_repositories string,

ws_request_id string,

ws_roles string,

ws_stack string,

ws_stack_version string,

ws_version_note string,

ws_version_number string,

ws_status string,

ws_result_status string,

cliType string,

reqUser string,

task_id int,

cluster string,

type string,

path string,

logfile_line_number int,

host string,

cliIP string,

id string,

event_md5 string,

message_md5 string,

seq_num int

)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

(LOCATION '<some directory path>');

- Ranger Audit logs

CREATE EXTERNAL TABLE audit_logs_ranger

(

evtTime string,

access string,

enforcer string,

resource string,

result int,

action string,

reason string,

resType string,

reqUser string,

cluster string,

cliIP string,

id string,

seq_num int

)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

(LOCATION '<some directory path>');

​Command Line Options

Command Line Options