Sensitive Data Redaction

Data redaction is the suppression of sensitive data, such as any personally identifiable information (PII). PII can be used on its own or with other information to identify or locate a single person, or to identify an individual in context. Enabling redaction allow you to transform PII to a pattern that does not contain any identifiable information. For example, you could replace all Social Security numbers (SSN) like 123-45-6789 with an unintelligible pattern like XXX-XX-XXXX, or replace only part of the SSN (XXX-XX-6789).

Although encryption techniques are available to protect Hadoop data, the underlying problem with using encryption is that an admin who has complete access to the cluster also access to unencrypted sensitive user data. Even users with appropriate ACLs on the data could have access to logs and queries where sensitive data might have leaked.

Data redaction provides compliance with industry regulations such as PCI and HIPAA, which require that access to PII be restricted to only those users whose jobs require such access. PII or other sensitive data must not be available through any other channels to users like cluster administrators or data analysts. This is because redaction only applies to any incidental leaks of data. For example, if a user already has the required permissions to access PII through queries, then query results will not be redacted.

Password Redaction

Starting with Cloudera Manager and CDH 5.5, passwords will no longer be accessible in cleartext through the Cloudera Manager UI or in the configuration files stored on disk. For components such as HDFS, HBase, Hive, and so on, that use core Hadoop, the feature has been implemented by using Hadoop's CredentialProvider interface to encrypt and store passwords inside a secure creds.jceks keystore file. For components such as Hue and Impala, that do not use core Hadoop, instead of the password, we use a password_script = /path/to/script/that/will/emit/password.sh parameter that, when run, writes the password to stdout. Passwords contained within Cloudera Manager and Cloudera Navigator properties have been redacted internally in Cloudera Manager.

However, the database password contained in Cloudera Manager Server's /etc/cloudera-scm-server/db.properties file has not been redacted. The db.properties file is managed by customers and is populated manually when the Cloudera Manager Server database is being set up for the first time. Since this occurs before the Cloudera Manager Server has even started, encrypting the contents of this file is a completely different challenge as compared to that of redacting configuration files.

Password redaction (not including log and query redaction) is enabled by default for deployments with Cloudera Manager 5.5 (or higher) managing CDH 5.5 (or higher). There are no user-visible controls to enable or disable this feature. It is expected to work out of the box. The primary places where you will encounter the effects of password redaction are:
  • In the Cloudera Manager Admin Console, on the Processes page for a given role instance, passwords in the linked configuration files have been replaced by *******.
  • On the Cloudera Manager Server and Agent hosts, all configuration files in the /var/run/cloudera-scm-agent/process directory will have their passwords replaced by *******.
Exceptions:
  • Solr does not have this feature enabled.
  • The database password contained in Cloudera Manager Server's /etc/cloudera-scm-server/db.properties file has not been redacted.

Scope - Log and Query Redaction

Data redaction in CDH targets sensitive SQL data and log files. Currently, you can enable or disable redaction for the whole cluster with a simple HDFS service-wide configuration change. Redaction is implemented with the assumption that sensitive information resides in the data itself, not the metadata. If you enable redaction for a file, only sensitive data inside the file is redacted. Metadata such as the name of the file or file owner is not redacted.

When data redaction is enabled, the following data is redacted:
  • Logs in HDFS and any dependent cluster services. Log redaction is not available in Isilon-based clusters.
  • Audit data sent to Cloudera Navigator
  • SQL query strings displayed by Hue, Hive, and Impala.

Redaction Rules

Redaction is based on pattern matching. Use regular expressions to define redaction rules that search for patterns of sensitive information such as Social Security numbers, credit card numbers, and dates.

Use Cloudera Manager to create redaction rules that have the following components:

  • Search - A regular expression matched against the data. If the expression matches any part of the data, the match is replaced by the contents of the replace string. For example, to redact credit card numbers, your regular expression is \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}.
  • Replace - The string used to replace the redacted data. For example, to replace any matched credit card digits with Xs, the Replace string value would be XXXX-XXXX-XXXX-XXXX.
  • Trigger - An optional field that specifies a simple string to be searched for in the data. The redactor searches for matches to the search regular expression only if the string is found,. If no trigger is specified, redaction occurs when the Search regular expression is matched. Using the Trigger field improves performance: simple string matching is faster than regular expression matching.
The following redaction rules are preconfigured (not enabled) in Cloudera Manager. The order in which the rules are specified is relevant. For example, in the list of rules below, credit card numbers are redacted first, followed by SSNs, email addresses, and finally, hostnames.
Redaction Rule Search Expression Replace Expression
Credit Card numbers (with separator) \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} XXXX-XXXX-XXXX-XXXX
Social Security numbers (with separator) \d{3}[^\w]\d{2}[^\w]\d{4} XXX-XX-XXXX
Email addresses
\b([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-\._]       \
*[A-Za-z0-9])@(([A-Za-z0-9]|[A-Za-z]             \
[A-Za-z0-9\-]*[A-Za-z0-9])\.)+([A-Za-z0-9]       \
|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b
email@redacted.host
Hostnames
\b(([A-Za-z]|[A-Za-z][A-Za-z0-9\-]          \
*[A-Za-z0-9])\.)+([A-Za-z0-9]                \
|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b
HOSTNAME.REDACTED

Cloudera Manager API Redaction

Cloudera Manager API does not have redaction enabled by default. You can configure redaction of the sensitive items by specifying a JVM parameter for Cloudera Manager. When you set this parameter, API calls to Cloudera Manager for configuration data do not include the sensitive information. For more information, see Redacting Sensitive Information from the Exported Configuration.

Enabling Log and Query Redaction Using Cloudera Manager

Cloudera recommends using the new layout in Cloudera Manager, instead of the classic layout, to enable redaction. The new layout allows you to add preconfigured redaction rules and test your rules inline. To enable log and query redaction in Cloudera Manager:
  1. Go to the HDFS service.
  2. Click the Configuration tab.
  3. In the Search box, type redaction to bring up the following redaction properties.
    Property Description
    Enable Log and Query Redaction Check this checkbox to enable log and query redaction for the cluster.
    Log and Query Redaction Policy List of rules for redacting sensitive information from log files and query strings. Choose a preconfigured rule or add a custom rule.

    Test your rules by entering sample text into the Test Redaction Rules text box and click Test Redaction. If no rules match, the text you entered is returned unchanged.

  4. Optionally, enter a reason for the configuration changes.
  5. Click Save Changes to commit the changes.
  6. Restart the cluster.

Configuring the Cloudera Navigator Data Management Component to Redact PII

You can specify credit card number patterns and other PII to be masked in audit events, in the properties of entities displayed in lineage diagrams, and in information retrieved from the Audit Server database and the Metadata Server persistent storage. Redacting data other than credit card numbers is not supported out-of-the-box with this Cloudera Navigator property. You may use a different regular expression to redact Social Security numbers or other PII. Masking is not applied to audit events and lineage entities that existed before the mask was enabled.

Minimum Required Role: Navigator Administrator (also provided by Full Administrator)

  1. Do one of the following:
    • Select Clusters > Cloudera Management Service > Cloudera Management Service.
    • On the Status tab of the Home > Status tab, in Cloudera Management Service table, click the Cloudera Management Service link.
  2. Click the Configuration tab.
  3. Expand the Navigator Audit Server Default Group category.
  4. Click the Advanced category.
  5. Configure the PII Masking Regular Expression property with a regular expression that matches the credit card number formats to be masked. The default expression is:
    (4[0-9]{12}(?:[0-9]{3})?)|(5[1-5][0-9]{14})|(3[47][0-9]{13})
    |(3(?:0[0-5]|[68][0-9])[0-9]{11})|(6(?:011|5[0-9]{2})[0-9]{12})|((?:2131|1800|35\\d{3})\\d{11})
    which is constructed from the following subexpressions:
    • Visa - (4[0-9]{12}(?:[0-9]{3})?)
    • MasterCard - (5[1-5][0-9]{14})
    • American Express - (3[47][0-9]{13})
    • Diners Club - (3(?:0[0-5]|[68][0-9])[0-9]{11})
    • Discover - (6(?:011|5[0-9]{2})[0-9]{12})
    • JCB - ((?:2131|1800|35\\d{3})\\d{11})
    If the property is left blank, PII information is not masked.
  6. Click Save Changes to commit the changes.