Onboarding CDP users and groups for Azure cloud storage (No RAZ)

The minimal setup defined earlier spins up a CDP environment and Data Lake with no end user access to cloud storage. Adding users and groups to a CDP environment involves ensuring that they are properly mapped to managed identities to access cloud storage.

In general, to have new users or groups onboarded, you need to have the following pre-created in Azure:

  1. First, you need to create two more containers within the storage account (mydatalake) created earlier, one for data engineers (for example, dataeng) and one for data scientists (for example, datascience).
  2. Next, you need to create two more managed identities, one for data engineers (for example, data-eng-mi) and one for data scientists (for example, data-eng-mi) and assign the Storage Blob Data Owner role on the scope of one these two nearly created containers. The data-eng-mi managed identity will need the Storage Blob Data Owner role on the scope of the dataeng container and the data-science-mi managed identity will need the Storage Blob Data Owner role on the scope of the datascience container.
  3. Finally, you also need to grant the Data Lake Admin managed identity created earlier the Storage Blob Data Owner role on the scope of these two newly created containers.

The final goal is to have the following that builds on the minimal setup presented earlier.

The first diagram illustrates the scenario where Backup Location Base is in the same location as the Logs Location Base:



The second diagram illustrates the scenario where Backup Location Base and Logs Location Base are separate:



The following documentation provides detailed steps for how to create this setup. The steps involve:
  1. Creating additional containers
  2. Crating additional managed identities
  3. Creating mappings in CDP
  4. Updating the Data Lake Admin managed identity

Creating additional containers

Within the ADLS Gen 2 storage account (mydatalake) created earlier, create two more containers (dataeng and datascience).

To create a file system, perform the following steps:

  1. On Azure Portal, navigate to Storage Accounts > your newly created storage account > Containers > +Container.
  2. Provide a name for your container and click OK.

Repeat these steps to create both containers.

Creating additional managed identities

After creating the two additional container, create two additional managed identities, one for data engineers (data-eng-mi) and one data scientists (data-science-mi).

To create the two new managed identities, perform the following steps:

  1. On Azure Portal, navigate to Managed Identities.
  2. Click +Add.
  3. Specify managed identity name and select the resource group that you created earlier.

Repeat these steps to create each of the two managed identities. Once you’ve created these managed identities, assign roles with specific scopes (limited to one of the two containers, dataeng or datascience respectively) to these identities as follows:

  1. Navigate to Storage accounts > your storage account > Containers > your container > Access Control (IAM).
  2. Click +Add > Add role assignment.
  3. Under Add role assignment:
    1. Under Role, select Storage Blob Data Owner.
    2. Under Assign access to, select User assigned managed identity.
    3. Under Select, select the managed identity.
    4. Click Save.

After performing these steps for each of the two managed identities, you should have the required managed identities created and their roles assigned on the correct scope.

Adding CDP user/group to managed identity mappings

After creating the two additional managed identities, one for data engineers (data-eng-mi) and one data scientists (data-science-mi), map them to specific user/group in CDP.

Required role: DataSteward, EnvironmentAdmin, or Owner

Steps

  1. The option to add/modify these mappings is available from the Management Console under Environments > click on an environment > Actions > Manage Access > IDBroker Mappings > Edit.
  2. Under Current Mappings, click Edit.
  3. Click + to display a new field for adding a mapping.
  4. Provide the following:
    1. The User or Group dropdown is pre-populated with CDP users and groups. Select the user or group that you would like to map.
    2. Under Role, specify the resource ID of a managed identity (copied from Azure Portal). You should select your data-eng-mi here.
  5. Repeat the previous two steps to add additional mapping for the data-science-mi.
  6. Click Save and Sync.

If you would like to create the mappings via CDP CLI, you can:

  1. Use the cdp environments get-id-broker-mappings command to obtain your current mappings.
  2. Use the cdp environments set-id-broker-mappings command to set additional mappings. The only way to use this command is to:
    • Pass all the current mappings
    • Add the new mappings
  3. Next, sync IDBroker mappings. For example:
    cdp environments sync-id-broker-mappings --environment-name demo3
  4. Finally, check the sync status. For example:
    cdp environments get-id-broker-mappings-sync-status --environment-name demo3

Updating access for the Data Lake Admin managed identity

Grant the Data Lake Admin identity created earlier the Storage Blob Data Owner role on the scope of the two newly created containers.

Perform the following steps for both dataeng or datascience containers to grant the Data Lake Admin managed identity access to them:

  1. Navigate to Storage accounts > your storage account > Containers > your container > Access Control (IAM).
  2. Click +Add > Add role assignment.
  3. Under Add role assignment:
    1. Under Role, select Storage Blob Data Owner.
    2. Under Assign access to, select User assigned managed identity.
    3. Under Select, select the Data Lake Admin Identity created earlier.
    4. Click Save.

Repeat these steps to provide Data Lake admin with access to both containers.