Data Lake security

Data Lake security and governance is managed by a shared set of services referred to as a Data Lake cluster.

Data Lake cluster services

A Data Lake cluster is managed by Cloudera Manager, and includes the following services:

  • Hive MetaStore (HMS) -- table metadata
  • Apache Ranger -- fine-grained authorization policies, auditing
  • Apache Atlas -- metadata management and governance: lineage, analytics, attributes
  • Apache Knox:
    • Authenticating Proxy for Web UIs and HTTP APIs -- SSO
    • IDBroker -- identity federation; cloud credentials

Currently there is one Data Lake cluster for each CDP environment. Security in all DataHub clusters created in a Data Lake is managed by these shared security and governance services.

Links to the Atlas and Ranger web UIs are provided on each DataLake home page. A link to the Data Lake cluster Cloudera Manager instance provides access to Data Lake cluster settings.

Apache Ranger

Apache Ranger manages access control through a user interface that ensures consistent policy administration across Data Lake components and DataHub clusters.

Security administrators can define security policies at the database, table, column, and file levels, and can administer permissions for groups or individual users. Rules based on dynamic conditions such as time or geolocation can also be added to an existing policy rule. Ranger security zones enable you to organize service resources into multiple security zones.

Ranger also provides a centralized framework for collecting access audit history and reporting data, including filtering on various parameters.

Apache Atlas

Apache Atlas provides a set of metadata management and governance services that enable you to manage data lake and DataHub cluster assets.

  • Search and Proscriptive Lineage – facilitates pre-defined and ad hoc exploration of data and metadata, while maintaining a history of data sources and how specific data was generated.
  • Ranger plugin for metadata-driven data access control.
  • Flexible modeling of both business and operational data.
  • Data Classification – helps you understand the nature of the data within Hadoop and classify it based on external and internal sources.

Apache Knox

Knox SSO provides web UI SSO (Single Sign-on) capabilities to Data Lakes and associated environments. Knox SSO enables users to log in once and gain access to Data Lake and DataHub cluster resources.

Knox IDBroker is an identity federation solution that provides temporary cloud credentials in exchange for various tokens or authentication.