Sensitive Data Redaction
Data redaction is the suppression of sensitive data, such as any personally identifiable information (PII). PII can be used on its own or with other information to identify or locate a single person, or to identify an individual in context. Enabling redaction allows you to transform PII to a pattern that does not contain any identifiable information. For example, you could replace all Social Security numbers (SSN) like 123-45-6789 with an unintelligible pattern like XXX-XX-XXXX, or replace only part of the SSN (XXX-XX-6789).
Although encryption techniques are available to protect Hadoop data, the underlying problem with using encryption is that an administrator who has complete access to the cluster also has access to unencrypted sensitive user data. Even users with appropriate ACLs on the data could have access to logs and queries where sensitive data might have leaked.
Data redaction provides compliance with industry regulations such as PCI and HIPAA, which require that access to PII be restricted to only those users whose jobs require such access. PII or other sensitive data must not be available through any other channels to users like cluster administrators or data analysts. This is because redaction only applies to any incidental leaks of data. For example, if a user already has the required permissions to access PII through queries, then query results will not be redacted.
Continue reading:
Password Redaction
Starting with Cloudera Manager and CDH 5.5, passwords will no longer be accessible in cleartext through the Cloudera Manager Admin Console or in the configuration files stored on disk. For components that use core Hadoop such as HDFS, HBase, and Hive, Cloudera Manager Server uses Hadoop's CredentialProvider interface to encrypt and store passwords inside a secure creds.jceks keystore file. For components that do not use core Hadoop, such as Hue and Impala, instead of the password, Cloudera Manager Server uses a password_script = /path/to/script/that/will/emit/password.sh parameter that, when run, writes the password to stdout. Passwords contained within Cloudera Manager and Cloudera Navigator properties have been redacted internally in Cloudera Manager.
However, the database password contained in Cloudera Manager Server's /etc/cloudera-scm-server/db.properties file has not been redacted. The db.properties file is managed by customers and is populated manually when the Cloudera Manager Server database is being set up for the first time. Since this occurs before the Cloudera Manager Server has even started, encrypting the contents of this file is a completely different challenge as compared to that of redacting configuration files.
- In the Cloudera Manager Admin Console, on the Processes page for a given role instance, passwords in the linked configuration files have been replaced by *******.
- On the Cloudera Manager Server and Agent hosts, all configuration files in the /var/run/cloudera-scm-agent/process directory will have their passwords replaced by *******.
- In the Cloudera Manager Admin Console, Advanced Configuration Snippet parameters will be redacted to block sensitive information such as passwords or secret keys. Users who have the
permission to edit the parameter will still see the sensitive words, but read-only users without edit privileges will only see the redacted version.
Redaction of Advanced Configuration Snippet parameters is based on detecting keywords explicitly defined as sensitive in the contents of these parameters. That is, parameters containing the keywords password, key, aws, or secret, will be redacted for users who do not have the required edit privileges. Default values for sensitive fields are not redacted since defaults are published in the public documentation. Default passwords pose a security risk and should not be used in production.
A limitation of this feature is that the list of keywords used to determine sensitive information is currently limited to those listed above and is not configurable using the Cloudera Manager Admin Console.
Log and Query Redaction - Scope and Rules
Data redaction in CDH targets sensitive SQL data and log files. You can enable or disable redaction for the whole cluster with a simple HDFS service-wide configuration change. Redaction is implemented with the assumption that sensitive information resides in the data itself, not the metadata. If you enable redaction for a file, only sensitive data inside the file is redacted. Metadata such as the name of the file or file owner is not redacted.
- Logs in HDFS and any dependent cluster services. Log redaction is not available in Isilon-based clusters.
- Audit data sent to Cloudera Navigator
- SQL query strings displayed by Hue, Hive, and Impala.
Redaction is based on pattern matching. Use regular expressions to define redaction rules that search for patterns of sensitive information such as Social Security numbers, credit card numbers, and dates.
Use Cloudera Manager to create redaction rules that have the following components:
- Search - A regular expression matched against the data. If the expression matches any part of the data, the match is replaced by the contents of the replace string. For example, to redact credit card numbers, your regular expression is \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4}.
- Replace - The string used to replace the redacted data. For example, to replace any matched credit card digits with Xs, the Replace string value would be XXXX-XXXX-XXXX-XXXX.
- Trigger - An optional field that specifies a simple string to be searched for in the data. The redactor searches for matches to the search regular expression only if the string is found,. If no trigger is specified, redaction occurs when the Search regular expression is matched. Using the Trigger field improves performance: simple string matching is faster than regular expression matching.
Redaction Rule | Search Expression | Replace Expression |
---|---|---|
Credit Card numbers (with separator) | \d{4}[^\w]\d{4}[^\w]\d{4}[^\w]\d{4} | XXXX-XXXX-XXXX-XXXX |
Social Security numbers (with separator) | \d{3}[^\w]\d{2}[^\w]\d{4} | XXX-XX-XXXX |
Email addresses |
\b([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-\._] \ *[A-Za-z0-9])@(([A-Za-z0-9]|[A-Za-z] \ [A-Za-z0-9\-]*[A-Za-z0-9])\.)+([A-Za-z0-9] \ |[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b |
email@redacted.host |
Hostnames |
\b(([A-Za-z]|[A-Za-z][A-Za-z0-9\-] \ *[A-Za-z0-9])\.)+([A-Za-z0-9] \ |[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])\b |
HOSTNAME.REDACTED |
Cloudera Manager API Redaction
Cloudera Manager API does not have redaction enabled by default. You can configure redaction of the sensitive items by specifying a JVM parameter for Cloudera Manager. When you set this parameter, API calls to Cloudera Manager for configuration data do not include the sensitive information. For more information, see Redacting Sensitive Information from the Exported Configuration.
Enabling Log and Query Redaction Using Cloudera Manager
- Go to the HDFS service.
- Click the Configuration tab.
- In the Search box, type redaction to bring up the following redaction properties.
Property Description Enable Log and Query Redaction Check this checkbox to enable log and query redaction for the cluster. Log and Query Redaction Policy List of rules for redacting sensitive information from log files and query strings. Choose a preconfigured rule or add a custom rule. Test your rules by entering sample text into the Test Redaction Rules text box and click Test Redaction. If no rules match, the text you entered is returned unchanged.
- Optionally, enter a reason for the configuration changes.
- Click Save Changes to commit the changes.
- Restart the cluster.
Configuring the Cloudera Navigator Data Management Component to Redact PII
You can specify credit card number patterns and other PII to be masked in audit events, in the properties of entities displayed in lineage diagrams, and in information retrieved from the Audit Server database and the Metadata Server persistent storage. Redacting data other than credit card numbers is not supported out-of-the-box with this Cloudera Navigator property. You may use a different regular expression to redact Social Security numbers or other PII. Masking is not applied to audit events and lineage entities that existed before the mask was enabled.
Minimum Required Role: Navigator Administrator (also provided by Full Administrator)
- Do one of the following:
- Select .
- On the Cloudera Management Service table, click the Cloudera Management Service link. tab, in
- Click the Configuration tab.
- Expand the Navigator Audit Server Default Group category.
- Click the Advanced category.
- Configure the PII Masking Regular Expression property with a regular expression that matches the credit card number
formats to be masked. The default expression is:
(4[0-9]{12}(?:[0-9]{3})?)|(5[1-5][0-9]{14})|(3[47][0-9]{13}) |(3(?:0[0-5]|[68][0-9])[0-9]{11})|(6(?:011|5[0-9]{2})[0-9]{12})|((?:2131|1800|35\\d{3})\\d{11})
which is constructed from the following subexpressions:- Visa - (4[0-9]{12}(?:[0-9]{3})?)
- MasterCard - (5[1-5][0-9]{14})
- American Express - (3[47][0-9]{13})
- Diners Club - (3(?:0[0-5]|[68][0-9])[0-9]{11})
- Discover - (6(?:011|5[0-9]{2})[0-9]{12})
- JCB - ((?:2131|1800|35\\d{3})\\d{11})
- Click Save Changes to commit the changes.