Command Line Options
Operation Modes
-m MODE, --mode=MODE archive | delete | save
The mode to use depends on the intent. Archive will store data into the desired storage medium and then remove the data after it has been stored, Delete is self explanatory, and Save is just like Archive except that data is not deleted after it has been stored.
Connecting to Solr
-s SOLR_URL, --solr-url=SOLR_URL>
The URL to use to connect to the specific Solr Cloud instance.
For example:
‘http://c6401.ambari.apache.org:8886/solr’.
-c COLLECTION, --collection=COLLECTION
The name of the Solr collection. For example: ‘hadoop_logs’
-k SOLR_KEYTAB,--solr-keytab=SOLR_KEYTAB
The keytab file to use when operating against a kerberized Solr instance.
-n SOLR_PRINCIPAL, --solr-principal=SOLR_PRINCIPAL
The principal name to use when operating against a kerberized Solr instance.
Record Schema
-i ID_FIELD, --id-field=ID_FIELD
The name of the field in the solr schema to use as the unique identifier for each record.
-f FILTER_FIELD, --filter-field=FILTER_FIELD
The name of the field in the solr schema to filter off of. For example: 'logtime’
-o DATE_FORMAT, --date-format=DATE_FORMAT
The custom date format to use with the -d DAYS field to match log entries that are older than a certain number of days.
-e END
Based on the filter field and date format, this argument configures the date that should be used as the end of the date range. If you use ‘2018-08-29T12:00:00.000Z’, then any records with a filter field that is after that date will be saved, deleted, or archived depending on the mode.
-d DAYS, --days=DAYS
Based on the filter field and date format, this argument configures the number days before today should be used as the end of the range. If you use ‘30’, then any records with a filter field that is older than 30 days will be saved, deleted, or archived depending on the mode.
-q ADDITIONAL_FILTER, --additional-filter=ADDITIONAL_FILTER
Any additional filter criteria to use to match records in the collection.
Extracting Records
-r READ_BLOCK_SIZE, --read-block-size=READ_BLOCK_SIZE
The number of records to read at a time from Solr. For example: ‘10’ to read 10 records at a time.
-w WRITE_BLOCK_SIZE, --write-block-size=WRITE_BLOCK_SIZE
The number of records to write per output file. For example: ‘100’ to write 100 records per file.
-j NAME, --name=NAME name included in result files
Additional name to add to the final filename created in save or archive mode.
--json-file
Default output format is one valid json document per record delimited by a newline. This option will write out a single valid JSON document containing all of the records.
-z COMPRESSION, --compression=COMPRESSION none | tar.gz | tar.bz2 | zip | gz
Depending on how output files will be analyzed, you have the choice to choose the optimal compression and file format to use for output files. Gzip compression is used by default.
Writing Data to HDFS
-a HDFS_KEYTAB, --hdfs-keytab=HDFS_KEYTAB
The keytab file to use when writing data to a kerberized HDFS instance.
-l HDFS_PRINCIPAL, --hdfs-principal=HDFS_PRINCIPAL
The principal name to use when writing data to a kerberized HDFS instance.
-u HDFS_USER, --hdfs-user=HDFS_USER
The user to connect to HDFS as.
-p HDFS_PATH, --hdfs-path=HDFS_PATH
The path in HDFS to write data to in save or archive mode.
Writing Data to S3
-t KEY_FILE_PATH, --key-file-path=KEY_FILE_PATH
The path to the file on the local file system that contains the AWS Access and Secret Keys. The file should contain the keys in this format: <accessKey>,<secretKey>
-b BUCKET, --bucket=BUCKET
The name of the bucket that data should be uploaded to in save or archive mode.
-y KEY_PREFIX, --key-prefix=KEY_PREFIX
The key prefix allows you to create a logical grouping of the objects in an S3 bucket. The prefix value is similar to a directory name enabling you to store data in the same directory in a bucket. For example, if your Amazon S3 bucket name is logs, and you set prefix to hadoop/, and the file on your storage device is hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz, then the file would be identified by this URL: http://s3.amazonaws.com/logs/hadoop/hadoop_logs_-_2017-10-28T01_25_40.693Z.json.gz
-g, --ignore-unfinished-uploading
To deal with connectivity issues, uploading extracted data can be retried. If you do not wish to resume uploads, use the -g flag to disable this behaviour.
Writing Data Locally
-x LOCAL_PATH, --local-path=LOCAL_PATH
The path on the local file system that should be used to write data to in save or archive mode
Examples
Deleting Indexed Data
In delete mode (-m delete), the program deletes data from the Solr collection. This mode uses the filter field (-f FITLER_FIELD) option to control which data should be removed from the index.
To delete log entries from the hadoop_logs collection, which have been created before August 29, 2017, we'll use the -f option to specify the field in the Solr collection to use as a filter field, and the -e option to denote the end of the range of values to remove.
infra-solr-data-manager -m delete -s
http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e
2017-08-29T12:00:00.000Z
Archiving Indexed Data
In archive mode, the program fetches data from the Solr collection and writes it out to HDFS or S3, then deletes the data.
The program will fetch records from Solr and creates a file once the write block size is reached, or if there are no more matching records found in Solr. The program keeps track of its progress by fetching the records ordered by the filter field, and the id field, and always saves their last values. Once the file is written, it’s is compressed using the configured compression type.
After the compressed file is created the program creates a command file containing instructions with next steps. In case of any interruptions or error during the next run for the same collection the program will start executing the saved command file, so all the data would be consistent. If the error is due to invalid configuration, and failures persist, the -g option can be used to ignore the saved command file.
The program supports writing data to HDFS, S3, or Local Disk.
Saving Indexed Data
Saving is similar to Archiving data except that the data is not deleted from Solr after the files are created and uploaded. The Save mode is recommended for testing that the data is written as expected before running the program in Archive mode with the same parameters.
Examples
- Archive data from the solr collection hadoop_logs accessible at http://c6401.ambari.apache.org:8886/solr based on the field logtime, save everything older than 1 day, read 10 documents at once, write 100 documents into a file, and copy the zip files into the local directory /tmp. Do this in verbose mode:
infra-solr-data-manager -m archive -s http://c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -d 1
-r 10 -w 100 -x /tmp -v
- Save the last 3 days of hdfs audit logs into HDFS path "/" with the user hdfs, fetching data from a kerberized Solr:
Infra-solr-data-manager -m save -s http:
//c6401.ambari.apache.org:8886/solr -c audit_logs -f logtime -d 3 -r
10 -w 100 -q type:\”hdfs_audit\” -j hdfs_audit -k
/etc/security/keytabs/ambari-infra-solr.service.keytab -n
infra-solr/c6401.ambari.apache.org@AMBARI.APACHE.ORG -u hdfs -p /
- Delete the data before 2017-08-29T12:00:00.000Z:
infra-solr-data-manager -m delete -s http:
//c6401.ambari.apache.org:8886/solr -c hadoop_logs -f logtime -e
2017-08-29T12:00:00.000Z
Handle archived data
After the data are archived, one way to deal with them is to load them into Hive tables. Only line delimited json files can be loaded this way, so use --line-delimited option. In order to be able to load the data one must either archive the data uncompressed ( --compression none ), or decompress the files before using them. One must also add a json SerDe before loading these data into the Hive table as the json data must be de-serialized. In the following examples hive-json-serde.jar is used, prior to the table creation it must be added in the Hive shell like this:
ADD JAR <path-to-jar>/hive-json-serde.jar
Here are some suggestions for table schemes for various log types. Using external tables is recommended, as it has the advantage of keeping the archives in HDFS. In the examples below the parts in ()s describe how to create external tables, omitting them would lead to managed hive tables. In order to create external talbes one must create directory in HDFS where the data will be kept:
hadoop fs -mkdir <some directory path>
Hadoop logs
CREATE (EXTERNAL) TABLE hadoop_logs
(
logtime string,
level string,
thread_name
string,
logger_name string,
file string,
line_number
int,
method string,
log_message string,
cluster
string,
type string,
path string,
logfile_line_number
int,
host string,
ip string,
id string,
event_md5 string,
message_md5
string,
seq_num int
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
(LOCATION
'<some directory path>');
Audit logs
As audit logs are slightly different, we suggest to archive them separately using --additional-filter, and we offer separate schemas for hdfs, ambari, and ranger audit logs.
- HDFS Audit logs
CREATE (EXTERNAL) TABLE
audit_logs_hdfs
(
evtTime string,
level string,
logger_name string,
log_message
string,
resource string,
result int,
action
string,
cliType string,
req_caller_id string,
ugi
string,
reqUser string,
proxyUsers array<string>,
authType
string,
proxyAuthType string,
dst string,
perm
string,
cluster string,
type string,
path
string,
logfile_line_number int,
host string,
ip
string,
cliIP string,
id string,
event_md5
string,
message_md5 string,
seq_num int
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
(LOCATION
'<some directory path>');
- Ambari Audit logs
CREATE (EXTERNAL) TABLE
audit_logs_ambari
(
evtTime string,
log_message
string,
resource string,
result int,
action
string,
reason string,
ws_base_url string,
ws_command
string,
ws_component string,
ws_details string,
ws_display_name
string,
ws_os string,
ws_repo_id string,
ws_repo_version
string,
ws_repositories string,
ws_request_id string,
ws_roles
string,
ws_stack string,
ws_stack_version string,
ws_version_note string,
ws_version_number
string,
ws_status string,
ws_result_status string,
cliType
string,
reqUser string,
task_id int,
cluster
string,
type string,
path string,
logfile_line_number
int,
host string,
cliIP string,
id
string,
event_md5 string,
message_md5 string,
seq_num
int
)
ROW FORMAT
SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
(LOCATION '<some directory path>');
- Ranger Audit logs
CREATE EXTERNAL TABLE
audit_logs_ranger
(
evtTime string,
access
string,
enforcer string,
resource string,
result
int,
action string,
reason string,
resType
string,
reqUser string,
cluster string,
cliIP
string,
id string,
seq_num int
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
(LOCATION '<some directory path>');