Pattern-based anonymization rules
Write pattern-based rules to anonymize data by pattern, using the extract pattern to extract content to anonymize.
Required and Optional Fields
-
name
-
description (optional)
-
rule_id (should be set to PATTERN)
-
patterns
-
extract (optional)
-
include_files (optional)
-
exclude_files (optional)
-
action (optional, default value is ANONYMIZE)
-
replace_value (optional, applicable only when action=REPLACE)
-
shared (optional, default value is true)
-
enabled (optional, default value is true)
{
"name": "EMAIL",
"rule_id": "Pattern",
"patterns": ["(?<![a-z0-9._%+-])[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}(?![a-z0-9._%+-])$?",
"shared": false
}
Hadoop 2.7.3.2.5.0.0-1245
Subversion git@github.com:hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z
Hadoop 2.7.3.2.5.0.0-1245
Subversion ‡qpe@unqfay.mjp‡:hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z
Rule Definition Example (with extract )
{
"name": "KEYSTORE",
"rule_id": "Pattern",
"patterns": ["oozie.https.keystore.pass=([^\\s]*)", "OOZIE_HTTPS_KEYSTORE_PASS=([^\\s]*)"],
"extract": "=([^\\s]*)",
"include_files": ["java_process.txt", "pid.txt", "ambari-agent.log", "java_process.txt", "oozie-env.cmd"],
"shared": false
}
oozie.https.keystore.pass=abcde
set OOZIE_HTTPS_KEYSTORE_PASS=12345
To anonymize the content of the input file, the following anonymization patterns configured in the rule will be used:
"oozie.https.keystore.pass=([^\\s]*)", "OOZIE_HTTPS_KEYSTORE_PASS=([^\\s]*)"
oozie.https.keystore.pass=([^\\s]*)
and
OOZIE_HTTPS_KEYSTORE_PASS=([^\\s]*)
match with
oozie.https.keystore.pass=abcde
and
OOZIE_HTTPS_KEYSTORE_PASS=12345
respectively.
Next, the extract pattern
"=([^\\s]*)
is used to identify 12345 and
abcde, which are the values to be anonymized.
The content of the output file oozie-env.cmd is:
oozie.https.keystore.pass=‡vvdwa‡
set OOZIE_HTTPS_KEYSTORE_PASS=‡zdowg‡
The values of
oozie.https.keystore.pass
and
OOZIE_HTTPS_KEYSTORE_PASS
have been anonymized.
More Examples
Example 1: Mask by pattern across all log files, without extract pattern
To mask all email addresses in all log files, use the following rule definition:
{
"name": "EMAIL",
"rule_id": "Pattern",
"patterns": ["(?<![a-z0-9._%+-])[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}(?![a-z0-9._%+-])"],
"include_files": ["*.log*"],
"shared": false
}
Example 2: Mask by pattern across all log files, with extract pattern
To mask encryption keys, logged in the following format Key=12.. with a value consisting of 64 hexadecimal characters, use the following rule definition:
{
"name": "ENC_KEYS",
"rule_id": "Pattern",
"patterns": ["Key=[a-f\\d]{64}\\s"],
"extract": "=([a-f\\d]{64})",
"include_files": ["*.log*"],
"shared": false
}
encryption key=1234567890adc1234567aaabc1234567890adc1234567aaabc12345678901234 for keystore
derby.system.home=null
Output data, test.log, with the encryption keys anonymized, is:
encryption key=‡8697685738fnx1736987qigyx7611731027yds0096404hlsph91727138403654‡ for keystore
derby.system.home=null
Example 3: Mask by pattern across all files, except a few files
To mask email addresses in all files, except hdfs-site.xml and .property files, use the following rule definition:
{
"name": "EMAIL",
"rule_id": "Pattern",
"patterns": ["(?<![a-z0-9._%+-])[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,6}(?![a-z0-9._%+-])"],
"exclude_files" : ["*.properties", "hdfs-site.xml"],
"shared": false
}
Input data, version.txt, is:
Hadoop 2.7.3.2.5.0.0-1245
Subversion git@github.com :hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z
Output file version.txt, with an anonymized email address, is:
Hadoop 2.7.3.2.5.0.0-1245
Subversion ‡qpe@unqfay.mjp‡ :hortonworks/hadoop.git -r cb6e514b14fb60e9995e5ad9543315cd404b4e59
Compiled by jenkins on 2016-08-26T00:55Z