How to Enable Sensitive Data Redaction

Redaction is a process that obscures data. It helps organizations comply with government and industry regulations, such as PCI (Payment Card Industry) and HIPAA, by making personally identifiable information (PII) unreadable except to those whose jobs require such access. For example, in simple terms, HIPAA legislation requires that patient PII is available only to appropriate medical professionals (and the patient), and that any medical or personal information exposed outside the appropriate context cannot be used to associate an individual's identity with any medical information. Data redaction can help ensure this privacy, by transforming PII to meaningless patterns—for example, transforming U.S. social security numbers to XXX-XX-XXXX strings.

Data redaction works separately from Cloudera data encryption techniques. Data encryption alone does not preclude administrators with full access to the cluster from viewing sensitive user data. Redaction ensures that cluster administrators, data analysts, and others cannot see PII or other sensitive data that is not within their job domain. At the same time, it does not prevent users with appropriate permissions from accessing data to which they have privileges.

Cloudera clusters implement some redaction features by default, while some features are configurable and require administrators to specifically enable them. The details are covered below:

Cloudera Manager and Passwords

Passwords are not in cleartext in the Cloudera Manager Admin Console or the configuration files on disk. Passwords managed by Cloudera Manager and Cloudera Navigator are redacted internally, with the following results:
  • In the Cloudera Manager Admin Console:
    • In the Processes page for a given role instance, passwords in the linked configuration files are replaced by *******.
    • Advanced Configuration Snippet (Safety Valve) parameters, such as passwords and secret keys, are visible to users (such as admins) who have edit permissions on the parameter, while those with read-only access see redacted data. However, the parameter name is visible to anyone. (Data to be redacted from these snippets is identified by a fixed list of key words: password, key, aws, and secret.)
  • On all Cloudera Manager Server and Cloudera Manager Agent hosts:
    • Passwords in the configuration files in /var/run/cloudera-scm-agent/process are replaced by ********.

Cloudera Manager Server Database Password Handling

Unlike the other passwords that are redacted or encrypted by Cloudera Manager, the password used for the Cloudera Manager Server database is stored in plaintext in the configuration file, /etc/cloudera-scm-server/db.properties, as shown in this example:

# Auto-generated by scm_prepare_database.sh on Mon Jan 30 05:02:18 PST 2017
#
# For information describing how to configure the Cloudera Manager Server
# to connect to databases, see the "Cloudera Manager Installation Guide."
#
com.cloudera.cmf.db.type=mysql
com.cloudera.cmf.db.host=localhost
com.cloudera.cmf.db.name=cm
com.cloudera.cmf.db.user=cm
com.cloudera.cmf.db.setupType=EXTERNAL
com.cloudera.cmf.db.password=password

Instead of using a cleartext password, you can use a script or other executable that uses stdout to return a password for use by the system.

During installation of the database, you can pass the script name to the scm_prepare_database.sh script with the --scm-password-script parameter. See Step 5: Set up and Configure the Cloudera Manager Database and Syntax for scm_prepare_database.sh for details.

You can also replace an existing cleartext password in /etc/cloudera-scm-server/db.properties by replacing the com.cloudera.cmf.db.password setting with com.cloudera.cmf.db.password_script and setting the name of the script or executable:

Cleartext Password (5.9 and prior) Script (5.10 and higher)
com.cloudera.cmf.db.password=password com.cloudera.cmf.db.password_script=script_name_here

At runtime, if /etc/cloudera-scm-server/db.properties does not include the script identified by com.cloudera.cmf.db.password_script, the system looks for the value of com.cloudera.cmf.db.password.

Cloudera Manager API Redaction

Cloudera Manager API has redaction enabled by default. If you use the API to export the configuration, the output may contain passwords and other sensitive information. The Cloudera Manager API automatically redacts the sensitive items returned from API calls.

You can disable redaction of the sensitive items by specifying a JVM parameter for Cloudera Manager. For more information, see Disabling Redaction of sensitive information when using the Cloudera Manager API.

Log and Query Redaction

Cloudera Manager provides a configurable log and query redaction feature that lets you redact sensitive data in the CDP cluster as it's being written to the log files (see the Cloudera Engineering Blog Sensitive Data Redaction post for a technical overview), to prevent leakage of sensitive data. Redaction works only on data, not metadata—that is, sensitive data inside files is redacted, but the name, owner, and other metadata about the file is not.

Redaction is enabled for the entire cluster through the Cloudera Manager Admin Console, which also lets you define rules to target sensitive data in SQL data and log files. After enabling data redaction, the following contain replacement strings (such as a series of Xs) for the sensitive data items you define in your rules:
  • Logs in HDFS and any dependent cluster services.
  • Audit data sent to Cloudera Navigator.
  • SQL query strings displayed by Hue, Hive, and Impala.

See Enabling Log and Query Redaction Using Cloudera Manager (below) for information about how to enable and define rules for sensitive data redaction for your cluster's logs and SQL queries (Hive, Hue, Impala).

How Redaction Rules Work

Cloudera's redaction process (redactor) uses regular expressions to target data for redaction. Common regular expression patterns for sensitive data include social security numbers, credit card numbers, email addresses, and dates, for example. The redaction rules are specified using the following elements:

  • Search - Regular expression to compare against the data. For example, the regular expression \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} searches for a credit card number pattern. Segments of data that match the regular expression are redacted using the Replace string.
  • Replace - String used to redact (obfuscate) data, such as a pattern of Xs to replace digits of a credit card number: XXXX-XXXX-XXXX-XXXX.
  • Trigger - Optional simple string to be searched before applying the regular expression. If the string is found, the redactor searches for matches using the Search regular expression. Using the Trigger field improves performance: simple string matching is faster than regular expression matching.
You can use the following preconfigured redaction rules on your cluster. Rules are applied in the order listed in the table.
Rule Regex Pattern Replacement
Credit Card numbers (with separator) \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} XXXX-XXXX-XXXX-XXXX
Social Security numbers (with separator) \d{3}[^\w]\d{2}[^\w]\d{4} XXX-XX-XXXX
Email addresses \b([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-\._] \ *[A-Za-z0-9])@(([A-Za-z0-9]|[A-Za-z] \ [A-Za-z0-9\-]*[A-Za-z0-9])\.)+([A-Za-z0-9] \ |[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b email@redacted.host
Hostnames \b(([A-Za-z]|[A-Za-z][A-Za-z0-9\-] \ *[A-Za-z0-9])\.)+([A-Za-z0-9] \ |[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b HOSTNAME.REDACTED

Ways to optimize regular expressions

Regular expressions are used to redact logs and queries. Regular expressions are powerful tools for pattern matching and string manipulation, but their performance impact can vary significantly depending on how they are used. Cloudera recommends that you use the log and query redaction feature and construct regex patterns thoughtfully. Be aware of the implications of different regex constructs on performance. Following are a few ways to optimize regex performance:
  • Simplify the pattern as much as possible
  • Use non-capturing groups if you do not need to extract data.
  • Avoid unnecessary backtracking.
  • Consider the specificities of the regex engine you are using.
  • Test the performance with realistic data sets.

Enabling Log and Query Redaction Using Cloudera Manager

To enable log and query redaction in Cloudera Manager:
  1. Login to the Cloudera Manager Admin Console.
  2. Select Clusters > CORE_SETTINGS.
  3. Click the Configuration tab.
  4. In the Search box, type redaction to find the redaction property settings:
    • Enable Log and Query Redaction
    • Log and Query Redaction Policy list of rules for redacting sensitive information from log files and query strings. Choose a preconfigured rule or add a custom rule. See How Redaction Rules Work for more information about rule pattern definitions.

      Test your rules:
      • Enter sample text into the Test Redaction Rules text box
      • Click Test Redaction.
  5. Click Save Changes.
  6. Restart the cluster.

If no rules match, the text you entered displays in the Results field, unchanged.

Using Cloudera Navigator Data Management for Data Redaction

You can specify credit card number patterns and other PII to be masked in audit events, in the properties of entities displayed in lineage diagrams, and in information retrieved from the Audit Server database and the Metadata Server persistent storage. Redacting data other than credit card numbers is not supported by default with the Cloudera Navigator property. You can use regular expressions to redact social security numbers or other PII. Masking applies only to audit events and lineage entities generated after enabling a mask.

Minimum Required Role: Full Administrator. This feature is not available when using Cloudera Manager to manage Data Hub clusters.

  1. Log into Cloudera Manager Admin Console.
  2. Select Clusters > Cloudera Management Service.
  3. Click the Configuration tab.
  4. Expand the Navigator Audit Server Default Group category.
  5. Click the Advanced category.
  6. Configure the PII Masking Regular Expression property with a regular expression that matches the credit card number formats to be masked. The default expression is:
    (4[0-9]{12}(?:[0-9]{3})?)|(5[1-5][0-9]{14})|
    (3[47][0-9]{13})|(3(?:0[0-5]|[68][0-9])[0-9]{11})|
    (6(?:011|5[0-9]{2})[0-9]{12})|((?:2131|1800|35\\d{3})\\d{11})
    
    which consolidates these regular expressions:
    • Visa - (4[0-9]{12}(?:[0-9]{3})?)
    • MasterCard - (5[1-5][0-9]{14})
    • American Express - (3[47][0-9]{13})
    • Diners Club - (3(?:0[0-5]|[68][0-9])[0-9]{11})
    • Discover - (6(?:011|5[0-9]{2})[0-9]{12})
    • JCB - ((?:2131|1800|35\\d{3})\\d{11})
    If the property is left blank, PII information is not masked.
  7. Click Save Changes.