Configure an Extractor Configuration File
You use the extractor configuration file to bulk load the enrichment store into HBase.
- On the host on which Metron is installed, log in as root.
- Determine the schema of the enrichment source.
- Create an extractor configuration file called
extractor_config_temp.json
and populate it with the enrichment source schema:{ "config" : { "columns" : { "domain" : 0 ,"owner" : 1 ,"home_country" : 2 ,"registrar": 3 ,"domain_created_timestamp": 4 } ,"indicator_column" : "domain" ,"type" : "whois" ,"separator" : "," } ,"extractor" : "CSV" }
-
Transform and filter the enrichment data as it is loaded into HBase by using
Stellar extractor properties in the extractor configuration file.
HCP supports the following Stellar extractor properties:
-
value_transform
-
Transforms fields defined in the
columns
mapping with Stellar transformations. New keys introduced in the transform are added to the key metadata:"value_transform" : { "domain" : "DOMAIN_REMOVE_TLD(domain)"
-
value_filter
-
Allows additional filtering with Stellar predicates based on results from the value transformations. In the following example, records whose domain property is empty after removing the TLD are omitted:
"value_filter" : "LENGTH(domain) > 0", "indicator_column" : "domain",
-
indicator_transform
-
Transforms the
indicator
column independent of the value transformations. You can refer to the original indicator value by usingindicator
as the variable name, as shown in the following example:"indicator_transform" : { "indicator" : "DOMAIN_REMOVE_TLD(indicator)"
In addition, if you prefer to piggyback your transformations, you can refer to the variable
domain
, which allows your indicator transforms to inherit transformations to this value. -
indicator_filter
-
Allows additional filtering with Stellar predicates based on results from the value transformations. In the following example, records with empty indicator values after removing the TLD are omitted:
"indicator_filter" : "LENGTH(indicator) > 0", "type" : "top_domains",
- state_init
- Allows a state object to be initialized. This is a string, so a single
expression is created. The output of this expression will be available as
the
state
variable. Use this property with theflatfile_summarizer.sh
rather than the loader. - state_update
- Allows a state object to be updated. This is a map, so you can have
temporary variables here. Note that you can reference the
state
variable from this. Use this property with theflatfile_summarizer.sh
rather than the loader. - state_merge
- Allows a list of states to be merged. This is a string, so a single
expression. There is a special field called
states
available, which is a list of the states (one per thread). Use this property with theflatfile_summarizer.sh
rather than the loader.
Including all of the supported Stellar extractor properties in the extractor configuration file, looks similar to the following:{ "config" : { "zk_quorum" : "$ZOOKEEPER_HOST:2181", "columns" : { "rank" : 0, "domain" : 1 }, "value_transform" : { "domain" : "DOMAIN_REMOVE_TLD(domain)" }, "value_filter" : "LENGTH(domain) > 0", "indicator_column" : "domain", "indicator_transform" : { "indicator" : "DOMAIN_REMOVE_TLD(indicator)" }, "indicator_filter" : "LENGTH(indicator) > 0", "type" : "top_domains", "separator" : "," }, "extractor" : "CSV" }
If you run a file import with this data and extractor configuration, you get the following two extracted data records:Indicator Type Value google top_domains { "rank" : "1", "domain" : "google" } yahoo top_domains { "rank" : "2", "domain" : "yahoo" } -
-
To access properties that reside in the global configuration file, provide a
ZooKeeper quorum by using the
zk_quorum
property.If the global configuration looks like"global_property" : "metron-ftw"
, enter the following to expand thevalue_transform
:"value_transform" : { "domain" : "DOMAIN_REMOVE_TLD(domain)", "a-new-prop" : "global_property" },
The resulting value data looks like the following:Indicator Type Value google top_domains { "rank" : "1", "domain" : "google", "a-new-prop" : "metron-ftw" } yahoo top_domains { "rank" : "2", "domain" : "yahoo", "a-new-prop" : "metron-ftw" } -
Remove any non-ASCII invisible characters included when you cut and pasted the
value_transform information:
iconv -c -f utf-8 -t ascii extractor_config_temp.json -o extractor_config.json
The extractor_config.json file is not stored by the loader. If you want to reuse it, you must store it yourself.