How to Enable Sensitive Data Redaction
Redaction is a process that obscures data. It helps organizations comply with government and industry regulations, such as PCI (Payment Card Industry) and HIPAA, by making personally identifiable information (PII) unreadable except to those whose jobs require such access. For example, in simple terms, HIPAA legislation requires that patient PII is available only to appropriate medical professionals (and the patient), and that any medical or personal information exposed outside the appropriate context cannot be used to associate an individual's identity with any medical information. Data redaction can help ensure this privacy, by transforming PII to meaningless patterns—for example, transforming U.S. social security numbers to XXX-XX-XXXX strings.
Data redaction works separately from Cloudera data encryption techniques. Data encryption alone does not preclude administrators with full access to the cluster from viewing sensitive user data. Redaction ensures that cluster administrators, data analysts, and others cannot see PII or other sensitive data that is not within their job domain. At the same time, it does not prevent users with appropriate permissions from accessing data to which they have privileges.
Cloudera clusters implement some redaction features by default, while some features are configurable and require administrators to specifically enable them. The details are covered below:
Cloudera Manager and Passwords
- In the Cloudera Manager Admin Console:
- In the Processes page for a given role instance, passwords in the linked configuration files are replaced by *******.
- Advanced Configuration Snippet (Safety Valve) parameters, such as passwords and secret keys, are visible to users (such as admins) who have edit permissions on the parameter, while those with read-only access see redacted data. However, the parameter name is visible to anyone. (Data to be redacted from these snippets is identified by a fixed list of key words: password, key, aws, and secret.)
- On all Cloudera Manager Server and Cloudera Manager Agent hosts:
- Passwords in the configuration files in /var/run/cloudera-scm-agent/process are replaced by ********.
Cloudera Manager Server Database Password Handling
# Auto-generated by scm_prepare_database.sh on Mon Jan 30 05:02:18 PST 2017 # # For information describing how to configure the Cloudera Manager Server # to connect to databases, see the "Cloudera Manager Installation Guide." # com.cloudera.cmf.db.type=mysql com.cloudera.cmf.db.host=localhost com.cloudera.cmf.db.name=cm com.cloudera.cmf.db.user=cm com.cloudera.cmf.db.setupType=EXTERNAL com.cloudera.cmf.db.password=password
Instead of using a cleartext password, you can use a script or other executable that uses stdout to return a password for use by the system.
During installation of the database, you can pass the script name to the scm_prepare_database.sh script with the --scm-password-script parameter. See Step 5: Set up the Cloudera Manager Database and Syntax for scm_prepare_database.sh for details.
You can also replace an existing cleartext password in /etc/cloudera-scm-server/db.properties by replacing the com.cloudera.cmf.db.password setting with com.cloudera.cmf.db.password_script and setting the name of the script or executable:
Cleartext Password (5.9 and prior) | Script (5.10 and higher) |
---|---|
com.cloudera.cmf.db.password=password | com.cloudera.cmf.db.password_script=script_name_here |
At runtime, if /etc/cloudera-scm-server/db.properties does not include the script identified by com.cloudera.cmf.db.password_script, the system looks for the value of com.cloudera.cmf.db.password.
Cloudera Manager API Redaction
Cloudera Manager API does not have redaction enabled by default. You can configure redaction of the sensitive items by specifying a JVM parameter for Cloudera Manager. When you set this parameter, API calls to Cloudera Manager for configuration data do not include the sensitive information. For more information, see Redacting Sensitive Information from the Exported Configuration.
Log and Query Redaction
Cloudera Manager provides a configurable log and query redaction feature that lets you redact sensitive data in the CDH cluster as it's being written to the log files (see the Cloudera Engineering Blog "Sensitive Data Redaction" post for a technical overview), to prevent leakage of sensitive data. Redaction works only on data, not metadata—that is, sensitive data inside files is redacted, but the name, owner, and other metadata about the file is not.
- Logs in HDFS and any dependent cluster services.
- Audit data sent to Cloudera Navigator.
- SQL query strings displayed by Hue, Hive, and Impala.
See Enabling Log and Query Redaction Using Cloudera Manager (below) for information about how to enable and define rules for sensitive data redaction for your cluster's logs and SQL queries (Hive, Hue, Impala).
How Redaction Rules Work
Cloudera's redaction process (redactor) uses regular expressions to target data for redaction. Common regular expression patterns for sensitive data include social security numbers, credit card numbers, email addresses, and dates, for example. The redaction rules are specified using the following elements:
- Search - Regular expression to compare against the data. For example, the regular expression \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} searches for a credit card number pattern. Segments of data that match the regular expression are redacted using the Replace string.
- Replace - String used to redact (obfuscate) data, such as a pattern of Xs to replace digits of a credit card number: XXXX-XXXX-XXXX-XXXX.
- Trigger - Optional simple string to be searched before applying the regular expression. If the string is found, the redactor searches for matches using the Search regular expression. Using the Trigger field improves performance: simple string matching is faster than regular expression matching.
Rule | Regex Pattern | Replacement |
---|---|---|
Credit Card numbers (with separator) | \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} | XXXX-XXXX-XXXX-XXXX |
Social Security numbers (with separator) | \d{3}[^\w]\d{2}[^\w]\d{4} | XXX-XX-XXXX |
Email addresses | \b([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-\._] \ *[A-Za-z0-9])@(([A-Za-z0-9]|[A-Za-z] \ [A-Za-z0-9\-]*[A-Za-z0-9])\.)+([A-Za-z0-9] \ |[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b | email@redacted.host |
Hostnames | \b(([A-Za-z]|[A-Za-z][A-Za-z0-9\-] \ *[A-Za-z0-9])\.)+([A-Za-z0-9] \ |[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b | HOSTNAME.REDACTED |
Enabling Log and Query Redaction Using Cloudera Manager
- Login to the Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- In the Search box, type redaction to find the redaction property settings:
- Enable Log and Query Redaction
- Log and Query Redaction Policy List of rules for redacting sensitive information from log files and query strings. Choose a preconfigured rule or add a custom rule. See How Redaction Rules Work for more information about rule pattern definitions.
Test your rules:- Enter sample text into the Test Redaction Rules text box
- Click Test Redaction.
- Click Save Changes.
- Restart the cluster.
If no rules match, the text you entered displays in the Results field, unchanged.
Using Cloudera Navigator Data Management for Data Redaction
You can specify credit card number patterns and other PII to be masked in audit events, in the properties of entities displayed in lineage diagrams, and in information retrieved from the Audit Server database and the Metadata Server persistent storage. Redacting data other than credit card numbers is not supported by default with the Cloudera Navigator property. You can use regular expressions to redact social security numbers or other PII. Masking applies only to audit events and lineage entities generated after enabling a mask.
Required Role: Navigator Administrator or Full Administrator
- Log into Cloudera Manager Admin Console.
- Select .
- Click the Configuration tab.
- Expand the Navigator Audit Server Default Group category.
- Click the Advanced category.
- Configure the PII Masking Regular Expression property with a regular expression that matches the credit card number
formats to be masked. The default expression is:
(4[0-9]{12}(?:[0-9]{3})?)|(5[1-5][0-9]{14})| (3[47][0-9]{13})|(3(?:0[0-5]|[68][0-9])[0-9]{11})| (6(?:011|5[0-9]{2})[0-9]{12})|((?:2131|1800|35\\d{3})\\d{11})
which consolidates these regular expressions:- Visa - (4[0-9]{12}(?:[0-9]{3})?)
- MasterCard - (5[1-5][0-9]{14})
- American Express - (3[47][0-9]{13})
- Diners Club - (3(?:0[0-5]|[68][0-9])[0-9]{11})
- Discover - (6(?:011|5[0-9]{2})[0-9]{12})
- JCB - ((?:2131|1800|35\\d{3})\\d{11})
- Click Save Changes.