Onboarding CDP users and groups for cloud storage

The minimal setup defined earlier spins up a CDP environment and Data Lake with no end user access to cloud storage. Adding users and groups to a CDP cluster involves ensuring they are properly mapped to IAM roles to access cloud storage.

In general, the new user or group to be onboarded needs to have the following IAM roles and policies pre-created in AWS:

  • One IAM role for the user/group
  • One IAM policy for the user/group role to access the required S3 bucket(s) and path(s)

In the example below, we are adding a data engineering group and a data science group to the cluster. The final goal is to have the following that builds on the minimal setup:

Role Permissions policy Trust policy Description
DATAENG_ROLE dataeng-policy-s3access

bucket-policy-s3access

dynamodb-policy

idbroker-role-trust-policy This role uses the three permissions policies to provide data engineers with access to a specific S3 location (s3://my-bucket/my-dl/dataeng).

The trust policy allows the role to be assumed by IDBroker.

DATASCI_ROLE datasci-policy-s3access

bucket-policy-s3access

dynamodb-policy

idbroker-role-trust-policy This role uses the three permissions policies to provide data scientists with access to a specific S3 location (s3://my-bucket/my-dl/datasci).

The trust policy allows the role to be assumed by IDBroker.

IAM policy definitions

Use the following IAM policy definitions for defining IAM policies.

Note that:

  • The policy definitions refer to roles by using the convention presented in the table above. If the IAM roles that you created use different names, you should update these names in the policy definitions below.

  • The policy definitions refer to the example S3 subdirectories presented above. If the S3 bucket sub-directories that you created use different names, you should update these names in the policy definitions below.

While creating these IAM policies, make sure to replace the following with actual values:

  • ${AWS_ACCOUNT_ID} - Your AWS account ID.

  • ${DATALAKE_PATH} - Path to your Data Lake directory under the Storage Location Base. For example my-bucket/my-dl. This does not have to be under the Storage Location Base, but for simplicity this example assumes it is a subdirectory of the Storage Location Base.

dataeng-policy-s3access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::${DATALAKE_PATH}/dataeng/*"
        }
    ]
}

datasci-policy-s3access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::${DATALAKE_PATH}/datasci/*"
        }
    ]
}

Creating IAM resources

You can create IAM roles and policies from the IAM console on AWS or from AWS CLI.

Adding CDP user/group to IAM role mappings

In order to use the IAM roles created above in CDP, the users/groups must be mapped to the IAM roles. The option to add/modify these mappings is available from the Management Console under Environments > click on an environment > Actions > Manage Access > IDBroker Mappings > Edit.

Under the IDBroker Mappings, you can change the mappings of users or groups to IAM roles. The user or group dropdown is prepopulated with CDP users and groups. On the right hand side, specify the role ARN (copied from the IAM role page) for that user or group that you are configuring.

For example, in the example setup we created the following roles:

  • DATAENG_ROLE - We created this role while onboarding users and we assume that there is a DataEngineers group that was created in CDP.

  • DATASCI_ROLE - We created this role while onboarding users and we assume that there is a DataScientists group that was created in CDP.

Based on the roles and groups created in this example, the mapping that need to be created are:

If you would like to create the mappings via CDP CLI, you can use the cdp environments set-id-broker-mappings command to set the mappings.