Sensitive Data Redaction

Data redaction is the suppression of sensitive data, such as any personally identifiable information (PII). PII can be used on its own or with other information to identify or locate a single person, or to identify an individual in context. Enabling redaction allow you to transform PII to a pattern that does not contain any identifiable information. For example, you could replace all Social Security numbers (SSN) like 123-45-6789 with an unintelligible pattern like XXX-XX-XXXX, or replace only part of the SSN (XXX-XX-6789).

Although encryption techniques are available to protect Hadoop data, the underlying problem with using encryption is that an admin who has complete access to the cluster also access to unencrypted sensitive user data. Even users with appropriate ACLs on the data could have access to logs and queries where sensitive data might have leaked.

Data redaction provides compliance with industry regulations such as PCI and HIPAA, which require that access to PII be restricted to only those users whose jobs require such access. PII or other sensitive data must not be available through any other channels to users like cluster administrators or data analysts. However, if you already have permissions to access PII through queries, the query results will not be redacted. Redaction only applies to any incidental leak of data. Queries and query results must not show up in cleartext in logs, configuration files, UIs, or other unprotected areas.


Data redaction in CDH targets sensitive SQL data and log files. Currently, you can enable or disable redaction for the whole cluster with a simple HDFS service-wide configuration change. Redaction is implemented with the assumption that sensitive information resides in the data itself, not the metadata. If you enable redaction for a file, only sensitive data inside the file is redacted. Metadata such as the name of the file or file owner is not redacted.

When data redaction is enabled, the following data is redacted:
  • Logs in HDFS and any dependent cluster services. Log redaction is not available in Isilon-based clusters.
  • Audit data sent to Cloudera Navigator
  • SQL query strings displayed by Hue, Hive, and Impala.

Redaction Rules

Redaction is based on pattern matching. Use regular expressions to define redaction rules that search for patterns of sensitive information such as Social Security numbers, credit card numbers, and dates.

Use Cloudera Manager to create redaction rules that have the following components:

  • Search - A regular expression matched against the data. If the expression matches any part of the data, the match is replaced by the contents of the replace string. For example, to redact credit card numbers, your regular expression is \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}.
  • Replace - The string used to replace the redacted data. For example, to replace any matched credit card digits with Xs, the Replace string value would be XXXX-XXXX-XXXX-XXXX.
  • Trigger - An optional field that specifies a simple string to be searched for in the data. The redactor searches for matches to the search regular expression only if the string is found,. If no trigger is specified, redaction occurs when the Search regular expression is matched. Using the Trigger field improves performance: simple string matching is faster than regular expression matching.
The following redaction rules are preconfigured (not enabled) in Cloudera Manager. The order in which the rules are specified is relevant. For example, in the list of rules below, credit card numbers are redacted first, followed by SSNs, email addresses, and finally, hostnames.
Redaction Rule Search Expression Replace Expression
Credit Card numbers (with separator) \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} XXXX-XXXX-XXXX-XXXX
Social Security numbers (with separator) \d{3}[^\w]\d{2}[^\w]\d{4} XXX-XX-XXXX
Email addresses
\b([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-\._]       \
*[A-Za-z0-9])@(([A-Za-z0-9]|[A-Za-z]             \
[A-Za-z0-9\-]*[A-Za-z0-9])\.)+([A-Za-z0-9]       \
\b(([A-Za-z]|[A-Za-z][A-Za-z0-9\-]          \
*[A-Za-z0-9])\.)+([A-Za-z0-9]                \

Enabling Log and Query Redaction Using Cloudera Manager

Cloudera recommends using the new layout in Cloudera Manager, instead of the classic layout, to enable redaction. The new layout allows you to add preconfigured redaction rules and test your rules inline. To enable log and query redaction in Cloudera Manager:
  1. Go to the HDFS service.
  2. Click the Configuration tab.
  3. In the Search box, type redaction to bring up the following redaction properties.
    Property Description
    Enable Log and Query Redaction Check this checkbox to enable log and query redaction for the cluster.
    Log and Query Redaction Policy List of rules for redacting sensitive information from log files and query strings. Choose a preconfigured rule or add a custom rule.

    Test your rules by entering sample text into the Test Redaction Rules text box and click Test Redaction. If no rules match, the text you entered is returned unchanged.

  4. Optionally, enter a reason for the configuration changes.
  5. Click Save Changes to commit the changes.
  6. Restart the cluster.

Configuring the Cloudera Navigator Data Management Component to Redact PII

You can specify credit card number patterns and other PII to be masked in audit events, in the properties of entities displayed in lineage diagrams, and in information retrieved from the Audit Server database and the Metadata Server persistent storage. Redacting data other than credit card numbers is not supported out-of-the-box with this Cloudera Navigator property. You may use a different regular expression to redact Social Security numbers or other PII. Masking is not applied to audit events and lineage entities that existed before the mask was enabled.

Minimum Required Role: Navigator Administrator (also provided by Full Administrator)

  1. Do one of the following:
    • Select Clusters > Cloudera Management Service > Cloudera Management Service.
    • On the Status tab of the Home page, in Cloudera Management Service table, click the Cloudera Management Service link.
  2. Click the Configuration tab.
  3. Expand the Navigator Audit Server Default Group category.
  4. Click the Advanced category.
  5. Configure the PII Masking Regular Expression property with a regular expression that matches the credit card number formats to be masked. The default expression is:
    which is constructed from the following subexpressions:
    • Visa - (4[0-9]{12}(?:[0-9]{3})?)
    • MasterCard - (5[1-5][0-9]{14})
    • American Express - (3[47][0-9]{13})
    • Diners Club - (3(?:0[0-5]|[68][0-9])[0-9]{11})
    • Discover - (6(?:011|5[0-9]{2})[0-9]{12})
    • JCB - ((?:2131|1800|35\\d{3})\\d{11})
    If the property is left blank, PII information is not masked.
  6. Click Save Changes to commit the changes.