Data Sharing in Cloudera Data Catalog overview

Data Sharing in Cloudera Data Catalog enables secure, self-service access to Iceberg tables for external users using the Cloudera Iceberg REST Catalog. It simplifies data sharing by using the Cloudera Data Catalog user interface.

Cloudera Data Catalog provides a self-service, secure way to provide read-only access to Iceberg tables to external users building on Cloudera Iceberg REST Catalog, including compute engines supporting the Iceberg REST API, such as Snowflake, Databricks, Amazon EMR, Amazon Athena, and Amazon Redshift. Cloudera Data Sharing avoids duplicating data to external storage which enhances security, reduces storage costs and processing time especially by dynamic tables. For example a manufacturer (the Data Provider) can share manufacturing and inventory data to resellers or suppliers (Data Consumers). By using the Iceberg format, ETL processes can be avoided regardless of cloud environment providers and data ecosystems. Additionally, the Cloudera Iceberg REST Catalog uses time-bound Knox IDs and tokens (Client ID and Secret and access token) to manage security of the Data Sharing. Ranger audit reports are shown in Cloudera Data Catalog for Data Share updates and access events.

Cloudera Data Catalog greatly simplifies Data Sharing by providing centralized user interface and asset discovery (particularly useful when setting up Data Shares). It also enables adding metadata in the form keywords (to Data Shares) and Atlas classifications (to assets) to improve searchability. The user interface also provides a quick overview of audit reports for Data Share updates and access events.

Who are Data Providers?

Designate Cloudera users as Data Providers to manage the responsibility of sharing data stored in Cloudera with external users or across organizations. They are usually data owners or data stewards. They will have the responsibility to designate Iceberg tables to be shared. Additionally, they have the responsibility to generate secure, timed credentials (Client ID and Secret) for users, and to control how long the data can be accessed.

Data Providers must have the DataShareAdmin resource role, limited for a specific cloud environment.

Although there can be more Data Providers in an environment, they have a federated view of all Data Shares across Data Lakes.

Who are Data Consumers?

Data Providers can designate external users as Data Consumers by setting up their Knox Client IDs and Secrets. These credentials have to be shared with them so that they can enter them into their Iceberg REST Catalog compatible compute engines. For example, they can be users of a supply chain department or a third-party vendor consuming the manufacturing data to optimize their own operations.

How does Data Sharing work?

After the Data Providers set up Data Consumers by creating Knox credentials (Client ID, Secret), they create Data Shares, logical units of grouped Iceberg tables using the Cloudera Data Catalog user interface of CDP CLI. During Data Share creation, they assign the assets to be shared and Data Consumers. Behind the scenes Knox authorizes the users and manages the lifecycle of the credentials (expiration, revocation, and regeneration). Also, it routes the incoming read and authentication requests.

Ranger creates groups and the read-only policies for the designated Iceberg tables for the authenticated external users. Additionally, Ranger handles the asset and access auditing.

Hive provides the storage layer and serves the read request within the framework of Iceberg REST Catalog API.