Minimal setup for cloud storage

This minimal secure setup uses one ADLS Gen2 storage account with one file system for each Data Lake and multiple managed identities, where each managed identity has at least one role assigned.

The example setup includes:

  • One ADLS Gen2 storage account with two containers within it (one for Storage Location Base and another for Logs Location Base):
    • One ADLS Gen2 container is required to use as Storage Location Base such as abfs://storage-fs@my-datalake.dfs.core.windows.net where my-datalake is your storage account name and storage-fs is your container name.

      Cloudera recommends that all storage accounts be in same subscription as a best practice.

      CDP will automatically create the following locations for Ranger audits:

      • The location for Ranger audits will be created automatically based on what you specify during environment creation as Storage Location Base and the suffix /ranger/audit. In our example setup, this location will be at abfs://storage-fs@my-datalake.dfs.core.windows.net/ranger/audit
      • One ADLS Gen2 container is required to use as Logs Location Base such as abfs://logs-fs@mydatalake.dfs.core.windows.net where my-datalake is your storage account name and logs-fs is your container name. CDP will automatically create the following locations for log storage (including FreeIPA logs):
      • The location for log storage will be created automatically based on what you specify during environment creation as Logs Location Base and the suffix /cluster-logs. In our example setup, this location will be at abfs://logs-fs@my-datalake.dfs.core.windows.net/cluster-logs
      • The location for FreeIPA logs will be created automatically based on what you specify during environment creation as Logs Location Base and the suffix /cluster-logs/freeipa. In our example setup, this location will be at abfs://logs-fs@my-datalake.dfs.core.windows.net/cluster-logs/freeipa
  • Four managed identities: The IDBroker component of CDP uses user-assigned managed identities for controlling access to ADLS Gen2 and stores and manages the mappings between the services/users and the corresponding managed identities. The following managed identities must be created:
    • Assumer identity
      • Description: During Data Lake cluster creation, CDP will attach this identity to the IDBroker VM. IDBroker will then use it to attach the other managed identities to the IDBroker VM. Once these identities are attached to the VM, IDBroker can acquire an access token for them (to eliminate the need to store credentials in the application).
      • Steps: Create a managed identity and then assign the Virtual Machine Contributor and Managed Identity Operator roles to this managed identity on the scope of the subscription. This is required because CDP currently does not support deploying into an existing resource group.
    • Data Lake Admin identity
      • Description: This managed identity will be used for CDP services to access data.
      • Steps: Create a managed identity and then assign the Storage Blob Data Owner role to this managed identity on the scope of the two containers (Storage Location Base and Logs Location Base) created for CDP.
    • Ranger Audit Logger identity
      • Description: This managed identity will be used by Ranger to write audits.
      • Steps: Create a managed identity and then assign Storage Blob Data Contributor role to this managed identity on the scope of the Storage Location Base container created for CDP.
    • Logger identity
      • Description: This managed identity will be used by CDP to write telemetry logs.
      • Steps: Create a managed identity and then assign Storage Blob Data Contributor role to this managed identity on the scope of the Logs Location Base created for CDP
The following diagram illustrates the required setup: