How to Enable Sensitive Data Redaction
Redaction is a process that obscures data. It helps organizations comply with government and
industry regulations, such as PCI (Payment Card Industry) and HIPAA, by making personally
identifiable information (PII) unreadable except to those whose jobs require such access. For
example, in simple terms, HIPAA legislation requires that patient PII is available only to
appropriate medical professionals (and the patient), and that any medical or personal
information exposed outside the appropriate context cannot be used to associate an
individual's identity with any medical information. Data redaction can help ensure this
privacy, by transforming PII to meaningless patterns—for example, transforming U.S. social
security numbers to XXX-XX-XXXX
strings.
Data redaction works separately from Cloudera data encryption techniques. Data encryption alone does not preclude administrators with full access to the cluster from viewing sensitive user data. Redaction ensures that cluster administrators, data analysts, and others cannot see PII or other sensitive data that is not within their job domain. At the same time, it does not prevent users with appropriate permissions from accessing data to which they have privileges.
Cloudera clusters implement some redaction features by default, while some features are configurable and require administrators to specifically enable them. The details are covered below:
Cloudera Manager and Passwords
- In the Cloudera Manager Admin Console:
- In the Processes page for a given role instance,
passwords in the linked configuration files are replaced by
*******
. - Advanced Configuration Snippet (Safety Valve) parameters, such as passwords and secret keys, are visible to users (such as admins) who have edit permissions on the parameter, while those with read-only access see redacted data. However, the parameter name is visible to anyone. (Data to be redacted from these snippets is identified by a fixed list of key words: password, key, aws, and secret.)
- In the Processes page for a given role instance,
passwords in the linked configuration files are replaced by
- On all Cloudera Manager Server and Cloudera Manager Agent hosts:
- Passwords in the configuration files in
/var/run/cloudera-scm-agent/process
are replaced by********
.
- Passwords in the configuration files in
Cloudera Manager Server Database Password Handling
/etc/cloudera-scm-server/db.properties
, as shown in
this example:
# Auto-generated by scm_prepare_database.sh on Mon Jan 30 05:02:18 PST 2017
#
# For information describing how to configure the Cloudera Manager Server
# to connect to databases, see the "Cloudera Manager Installation Guide."
#
com.cloudera.cmf.db.type=mysql
com.cloudera.cmf.db.host=localhost
com.cloudera.cmf.db.name=cm
com.cloudera.cmf.db.user=cm
com.cloudera.cmf.db.setupType=EXTERNAL
com.cloudera.cmf.db.password=password
Instead of using a cleartext password, you can use a script or other
executable that uses stdout
to return a password for
use by the system.
During installation of the database, you can pass the script name to the
scm_prepare_database.sh
script with the
--scm-password-script
parameter. See Step 5: Set up and Configure
the Cloudera Manager Database
and Syntax for scm_prepare_database.sh
for
details.
You can also replace an existing cleartext password in
/etc/cloudera-scm-server/db.properties
by replacing
the com.cloudera.cmf.db.password
setting with
com.cloudera.cmf.db.password_script
and setting the
name of the script or executable:
Cleartext Password (5.9 and prior) | Script (5.10 and higher) |
---|---|
com.cloudera.cmf.db.password=password | com.cloudera.cmf.db.password_script=script_name_here |
At runtime, if
/etc/cloudera-scm-server/db.properties
does not
include the script identified by
com.cloudera.cmf.db.password_script
, the system
looks for the value of
com.cloudera.cmf.db.password
.
Cloudera Manager API Redaction
Cloudera Manager API has redaction enabled by default. If you use the API to export the configuration, the output may contain passwords and other sensitive information. The Cloudera Manager API automatically redacts the sensitive items returned from API calls.
You can disable redaction of the sensitive items by specifying a JVM parameter for Cloudera
Manager. For more information, see Disabling Redaction of sensitive information when
using the Cloudera Manager API
.
Log and Query Redaction
Cloudera Manager provides a configurable log and query redaction feature that lets you
redact sensitive data in the CDP cluster as it's being written to the log files (see the
Cloudera Engineering Blog Sensitive Data Redaction
post for a technical overview), to
prevent leakage of sensitive data. Redaction works only on data, not metadata—that is,
sensitive data inside files is redacted, but the name, owner, and other metadata about the
file is not.
X
s) for the sensitive data items you define in your
rules:- Logs in HDFS and any dependent cluster services.
- Audit data sent to Cloudera Navigator.
- SQL query strings displayed by Hue, Hive, and Impala.
See Enabling Log and Query Redaction Using Cloudera Manager
(below) for information about
how to enable and define rules for sensitive data redaction for your cluster's logs and SQL
queries (Hive, Hue, Impala).
How Redaction Rules Work
Cloudera's redaction process (redactor) uses regular expressions to target data for redaction. Common regular expression patterns for sensitive data include social security numbers, credit card numbers, email addresses, and dates, for example. The redaction rules are specified using the following elements:
- Search - Regular expression to compare against the data. For
example, the regular expression
\d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}
searches for a credit card number pattern. Segments of data that match the regular expression are redacted using the Replace string. - Replace - String used to redact (obfuscate) data, such as a
pattern of Xs to replace digits of a credit card number:
XXXX-XXXX-XXXX-XXXX
. - Trigger - Optional simple string to be searched before applying the regular expression. If the string is found, the redactor searches for matches using the Search regular expression. Using the Trigger field improves performance: simple string matching is faster than regular expression matching.
Rule | Regex Pattern | Replacement |
---|---|---|
Credit Card numbers (with separator) | \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} |
XXXX-XXXX-XXXX-XXXX |
Social Security numbers (with separator) | \d{3}[^\w]\d{2}[^\w]\d{4} |
XXX-XX-XXXX |
Email addresses |
\b([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-\._] \
*[A-Za-z0-9])@(([A-Za-z0-9]|[A-Za-z] \
[A-Za-z0-9\-]*[A-Za-z0-9])\.)+([A-Za-z0-9] \
|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b
|
email@redacted.host |
Hostnames |
\b(([A-Za-z]|[A-Za-z][A-Za-z0-9\-] \
*[A-Za-z0-9])\.)+([A-Za-z0-9] \
|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b
|
HOSTNAME.REDACTED |
Ways to optimize regular expressions
- Simplify the pattern as much as possible
- Use non-capturing groups if you do not need to extract data.
- Avoid unnecessary backtracking.
- Consider the specificities of the regex engine you are using.
- Test the performance with realistic data sets.
Enabling Log and Query Redaction Using Cloudera Manager
- Login to the Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- In the Search box,
type
redaction
to find the redaction property settings:- Enable Log and Query Redaction
- Log and Query Redaction Policy list of rules for redacting sensitive information
from log files and query strings. Choose a preconfigured rule or add a custom
rule. See
How Redaction Rules Work
for more information about rule pattern definitions.
Test your rules:- Enter sample text into the Test Redaction Rules text box
- Click Test Redaction.
- Click Save Changes.
- Restart the cluster.
If no rules match, the text you entered displays in the Results field, unchanged.
Using Cloudera Navigator Data Management for Data Redaction
You can specify credit card number patterns and other PII to be masked in audit events, in the properties of entities displayed in lineage diagrams, and in information retrieved from the Audit Server database and the Metadata Server persistent storage. Redacting data other than credit card numbers is not supported by default with the Cloudera Navigator property. You can use regular expressions to redact social security numbers or other PII. Masking applies only to audit events and lineage entities generated after enabling a mask.
Minimum Required Role: Full Administrator. This feature is not available when using Cloudera Manager to manage Data Hub clusters.
- Log into Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Expand the Navigator Audit Server Default Group category.
- Click the Advanced category.
- Configure the PII Masking Regular
Expression property with a regular expression that matches
the credit card number formats to be masked. The default
expression
is:
which consolidates these regular expressions:(4[0-9]{12}(?:[0-9]{3})?)|(5[1-5][0-9]{14})| (3[47][0-9]{13})|(3(?:0[0-5]|[68][0-9])[0-9]{11})| (6(?:011|5[0-9]{2})[0-9]{12})|((?:2131|1800|35\\d{3})\\d{11})
- Visa -
(4[0-9]{12}(?:[0-9]{3})?)
- MasterCard -
(5[1-5][0-9]{14})
- American Express -
(3[47][0-9]{13})
- Diners Club -
(3(?:0[0-5]|[68][0-9])[0-9]{11})
- Discover -
(6(?:011|5[0-9]{2})[0-9]{12})
- JCB -
((?:2131|1800|35\\d{3})\\d{11})
- Visa -
- Click Save Changes.