Command Line Options
Operation Modes
-m MODE, --mode=MODE archive | delete | save
The mode to use depends on the intent. Archive will store data into the desired storage medium and then remove the data after it has been stored, Delete is self explanatory, and Save is just like Archive except that data is not deleted after it has been stored.
--
Connecting to Solr
-s SOLR_URL, --solr-url=SOLR_URL>
The URL to use to connect to the specific Solr Cloud instance.
For example:
‘http://c6401.ambari.apache.org:8886/solr’.
-c COLLECTION, --collection=COLLECTION
The name of the Solr collection. For example: ‘hadoop_logs’
-k SOLR_KEYTAB,--solr-keytab=SOLR_KEYTAB
The keytab file to use when operating against a kerberized Solr instance.
-n SOLR_PRINCIPAL, --solr-principal=SOLR_PRINCIPAL
The principal name to use when operating against a kerberized Solr instance.
--
Record Schema
-i ID_FIELD, --id-field=ID_FIELD
The name of the field in the solr schema to use as the unique identifier for each record.
-f FILTER_FIELD, --filter-field=FILTER_FIELD
The name of the field in the solr schema to filter off of. For example: 'logtime’
-o DATE_FORMAT, --date-format=DATE_FORMAT
The custom date format to use with the -d DAYS field to match log entries that are older than a certain number of days.
-e END
Based on the filter field and date format, this argument configures the date that should be used as the end of the date range. If you use ‘2018-08-29T12:00:00.000Z’, then any records with a filter field that is after that date will be saved, deleted, or archived depending on the mode.
-d DAYS, --days=DAYS
Based on the filter field and date format, this argument configures the number days before today should be used as the end of the range. If you use ‘30’, then any records with a filter field that is older than 30 days will be saved, deleted, or archived depending on the mode.
-q ADDITIONAL_FILTER, --additional-filter=ADDITIONAL_FILTER
Any additional filter criteria to use to match records in the collection.
--
Extracting Records
-r READ_BLOCK_SIZE, --read-block-size=READ_BLOCK_SIZE
The number of records to read at a time from Solr. For example: ‘10’ to read 10 records at a time.
-w WRITE_BLOCK_SIZE, --write-block-size=WRITE_BLOCK_SIZE
The number of records to write per output file. For example: ‘100’ to write 100 records per file.
-j NAME, --name=NAME name included in result files
Additional name to add to the final filename created in save or archive mode.
--json-file
Default output format is one valid json document per record delimited by a newline. This option will write out a single valid JSON document containing all of the records.
-z COMPRESSION, --compression=COMPRESSION none | tar.gz | tar.bz2 | zip | gz
Depending on how output files will be analyzed, you have the choice to choose the optimal compression and file format to use for output files. Gzip compression is used by default.
--
Writing Data to HDFS
-a HDFS_KEYTAB, --hdfs-keytab=HDFS_KEYTAB
The keytab file to use when writing data to a kerberized HDFS instance.
-l HDFS_PRINCIPAL, --hdfs-principal=HDFS_PRINCIPAL
The principal name to use when writing data to a kerberized HDFS instance.
-u HDFS_USER, --hdfs-user=HDFS_USER
The user to connect to HDFS as.
-p HDFS_PATH, --hdfs-path=HDFS_PATH
The path in HDFS to write data to in save or archive mode.
--
Writing Data to S3
-t KEY_FILE_PATH, --key-file-path=KEY_FILE_PATH
The path to the file on the local file system that contains the AWS Access and Secret Keys. The file should contain the keys in this format: <accessKey>,<secretKey>
-b BUCKET, --bucket=BUCKET
The name of the bucket that data should be uploaded to in save or archive mode.
-y KEY_PREFIX, --key-prefix=KEY_PREFIX
The key prefix allows you to create a logical grouping of the objects in an S3 bucket. The prefix value is similar to a directory name enabling you to store data in the same directory in a bucket. For example, if your Amazon S3 bucket name is logs, and you set prefix to hadoop/, and the file on your storage device is hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz, then the file would be identified by this URL: http://s3.amazonaws.com/logs/hadoop/hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz
-g, --ignore-unfinished-uploading
To deal with connectivity issues, uploading extracted data can be retried. If you do not wish to resume uploads, use the -g flag to disable this behaviour.
--
Writing Data Locally
-x LOCAL_PATH, --local-path=LOCAL_PATH
The path on the local file system that should be used to write data to in save or archive mode
--
Examples
Deleting Indexed Data
In delete mode (-m delete), the program deletes data from the Solr collection. This mode uses the filter field (-f FITLER_FIELD) option to control which data should be removed from the index.
The command below will delete log entries from the hadoop_logs collection, which have been created before August 29, 2017, we'll use the -f option to specify the field in the Solr collection to use as a filter field, and the -e option to denote the end of the range of values to remove.
infra-solr-data-manager -m delete -s ://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e 2017-08-29T12:00:00.000Z
Archiving Indexed Data
In archive mode, the program fetches data from the Solr collection and writes it out to HDFS or S3, then deletes the data.
The program will fetch records from Solr and creates a file once the write block size is reached, or if there are no more matching records found in Solr. The program keeps track of its progress by fetching the records ordered by the filter field, and the id field, and always saves their last values. Once the file is written, it’s is compressed using the configured compression type.
After the compressed file is created the program creates a command file containing instructions with next steps. In case of any interruptions or error during the next run for the same collection the program will start executing the saved command file, so all the data would be consistent. If the error is due to invalid configuration, and failures persist, the -g option can be used to ignore the saved command file. The program supports writing data to HDFS, S3, or Local Disk.
The command below will archive data from the solr collection hadoop_logs accessible at http://c6401.ambari.apache.org:8886/solr based on the field logtime, and will extract everything older than 1 day, read 10 documents at once, write 100 documents into a file, and copy the zip files into the local directory /tmp.
infra-solr-data-manager -m archive -s http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -d 1 -r 10 -w 100 -x /tmp -v
Saving Indexed Data
Saving is similar to Archiving data except that the data is not deleted from Solr after the files are created and uploaded. The Save mode is recommended for testing that the data is written as expected before running the program in Archive mode with the same parameters.
The below example will save the last 3 days of hdfs audit logs into HDFS path "/" with the user hdfs, fetching data from a kerberized Solr.
infra-solr-data-manager -m save -s http://c6401.ambari.apache.org:8886/solr -c audit_logs -f logtime -d 3 -r 10 -w 100 -q type:\”hdfs_audit\” -j hdfs_audit -k /etc/security/keytabs/ambari-infra-solr.service.keytab -n infra-solr/c6401.ambari.apache.org@AMBARI.APACHE.ORG -u hdfs -p /
Analyzing Archived Data With Hive
Once data has been archived or saved to HDFS, Hive tables can be used to quickly access and analyzed stored data. Only line delimited JSON files can be analyzed with Hive. Line delimited JSON files are created by default unless the --json-file argument is passed. Data saved or archived using --json-file cannot be analyzed with Hive. In the following examples, the hive-json-serde.jar is used to process the stored JSON data. Prior to creating the included tables, the jar must be added in the Hive shell:
ADD JAR <path-to-jar>/hive-json-serde.jar
Here are some examples for table schemes for various log types. Using external tables is recommended, as it has the advantage of keeping the archives in HDFS. First ensure a directory is created to store archived or stored line delimited logs:
hadoop fs -mkdir <some directory path>
Hadoop Logs
CREATE EXTERNAL TABLE hadoop_logs ( logtime string, level string, thread_name string, logger_name string, file string, line_number int, method string, log_message string, cluster string, type string, path string, logfile_line_number int, host string, ip string, id string, event_md5 string, message_md5 string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '<some directory path>';
Audit Logs
As audit logs have a slightly different field set, we suggest to archive them separately using --additional-filter, and we offer separate schemas for HDFS, Ambari, and Ranger audit logs.
HDFS Audit Logs
CREATE EXTERNAL TABLE audit_logs_hdfs ( evtTime string, level string, logger_name string, log_message string, resource string, result int, action string, cliType string, req_caller_id string, ugi string, reqUser string, proxyUsers array<string>, authType string, proxyAuthType string, dst string, perm string, cluster string, type string, path string, logfile_line_number int, host string, ip string, cliIP string, id string, event_md5 string, message_md5 string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '<some directory path>';
Ambari Audit Logs
CREATE EXTERNAL TABLE audit_logs_ambari ( evtTime string, log_message string, resource string, result int, action string, reason string, ws_base_url string, ws_command string, ws_component string, ws_details string, ws_display_name string, ws_os string, ws_repo_id string, ws_repo_version string, ws_repositories string, ws_request_id string, ws_roles string, ws_stack string, ws_stack_version string, ws_version_note string, ws_version_number string, ws_status string, ws_result_status string, cliType string, reqUser string, task_id int, cluster string, type string, path string, logfile_line_number int, host string, cliIP string, id string, event_md5 string, message_md5 string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '<some directory path>';
Ranger Audit Logs
CREATE EXTERNAL TABLE audit_logs_ranger ( evtTime string, access string, enforcer string, resource string, result int, action string, reason string, resType string, reqUser string, cluster string, cliIP string, id string, seq_num int ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '<some directory path>';