Minimal setup for cloud storage

This minimal secure setup uses one S3 bucket for each Data Lake, and multiple IAM roles and policies.

You may choose a different setup, for example one using multiple buckets. Similarly, the IAM role and policy setup and names are just examples and you may choose a different setup. It is possible to have a setup with fewer roles and policies with broader access rights; however, such setup may not be secure for a production environment.

The example setup includes:

  • One S3 bucket with a sub-directory named after your data lake such as s3a://my-bucket/my-dl.
  • Four IAM roles:
    • IDBROKER_ROLE
    • LOG_ROLE
    • RANGER_AUDIT_ROLE
    • DATALAKE_ADMIN_ROLE
  • Eight IAM policies:
    • One AssumeRole policy (idbroker-assume-role) that can be used by the IDBroker component of the Data Lake cluster to assume each of the following roles:
      • RANGER_AUDIT_ROLE
      • DATALAKE_ADMIN_ROLE
    • Two trust policies
      • aws-cdp-ec2-role-trust-policy
      • aws-cdp-idbroker-role-trust-policy
    • Two shared policies for accessing S3 and DynamoDB:
      • aws-cdp-bucket-access-policy
      • aws-cdp-dynamodb-policy
    • Three policies for specific bucket directory access:
      • Log storage (aws-cdp-log-policy)
      • Ranger audit (aws-cdp-ranger-audit-s3-policy)
      • Data Lake admin (aws-cdp-datalake-admin-s3-policy)

Keep on reading to learn more details about the required setup.

Required cloud storage

One S3 bucket with a sub-directory named after your data lake such as s3a://my-bucket/my-dl is required for this setup.

During Data Lake creation, CDP will automatically create a location for Ranger audits, Data Lake, Data Hub and FreeIPA logs, and FreeIPA backups. The location for each of these depends on the supplied storage location base and logs location base. The directory structure will be created automatically by CDP within these base directories:

Storage Location Base examples
s3a://my-bucket/ s3a://my-bucket/my-dl
Ranger Audit Logs s3a://my-bucket/ranger/audit s3a://my-bucket/my-dl/ranger/audit
Logs Location Base examples
s3a://my-bucket/ s3a://my-bucket/my-dl
FreeIPA Logs s3a://my-bucket/cluster-logs/freeipa s3a://my-bucket/my-dl//cluster-logs/freeipa

If your environment was created prior to February 2021, this is s3a://my-bucket/my-dl/freeipa

FreeIPA Backup s3a://my-bucket/cluster-backups/freeipa s3a://my-bucket/my-dl/cluster-backups/freeipa

Required IAM resources

The following diagram summarizes the roles, policies, and S3 bucket directories in this example setup:

The following table lists the IAM roles and IAM policies that need to be created on AWS, and describes which policies should be assigned to which roles (as presented in the diagram, in some cases policies should be assigned to multiple roles). The policy definitions are provided in a separate section below the table:

Role Permissions policy Trust policy Description
IDBROKER_ROLE aws-cdp-idbroker-assume-role-policy

aws-cdp-log-policy

aws-cdp-ec2-role-trust-policy The assume role permissions policy must, at a minimum, allow the IDBROKER_ROLE to assume the RANGER_AUDIT_ROLE and the DATALAKE_ADMIN_ROLE. This policy must also allow the IDBROKER_ROLE to assume any other role for which a user or group mapping exists in the IDBroker.

Furthermore, the IDBROKER_ROLE needs the same permissions policy as the LOG_ROLE so that it can access the Logs Location Base.

The trust policy allows the role to be assumed by the IDBroker EC2 instance.

LOG_ROLE aws-cdp-log-policy aws-cdp-ec2-role-trust-policy This role uses the two permissions policies to provide CDP with access to the specific location called Logs Location Base for logs.

The trust policy allows the role to be assumed by EC2 instances in the cluster.

RANGER_AUDIT_ROLE aws-cdp-ranger-audit-s3-policy

aws-cdp-bucket-access-policy

aws-cdp-dynamodb-policy

aws-cdp-idbroker-role-trust-policy This role uses the three permissions policies to provide write access to the Ranger audit sub-directory that CDP creates within the Storage Location Base.

The trust policy allows the role to be assumed by IDBroker.

DATALAKE_ADMIN_ROLE aws-cdp-datalake-admin-s3-policy

aws-cdp-bucket-access-policy

aws-cdp-dynamodb-policy

aws-cdp-idbroker-role-trust-policy This role uses the three permissions policies to provide the Data Lake admin with full access to the whole Storage Location Base.

The trust policy allows the role to be assumed by IDBroker.

Creating an S3 bucket

You can create the S3 bucket from AWS CLI or from the S3 console on AWS. For instructions on how to create an S3 bucket, refer to Create a bucket.

Creating IAM roles and policies

You can create the IAM roles and policies from AWS CLI or from the IAM console on AWS.

  • For IAM policy definitions, refer to IAM policy definitions.
  • For instructions on how to create IAM policies, refer to Creating IAM policies in AWS documentation.
  • For instructions on how to create IAM roles, refer to Creating IAM roles in AWS documentation.

Once these resources have been created, here is how you should provide these roles and storage buckets in CDP:

Providing the parameters in the UI

Once you’ve created the bucket and the required instance profiles, provide the information related to these resources in the Register Environment wizard as follows:

Parameter

Description

Example

Data Access and Audit

Assumer Instance Profile

Select the IDBroker instance profile created earlier.

IDBROKER_ROLE

Storage Location Base

Enter the Storage Location Base S3 bucket location created earlier.

my-bucket/my-dl

Data Access Role

Select the DATALAKE_ADMIN_ROLE created earlier.

DATALAKE_ADMIN_ROLE

Ranger Audit Role

Select the RANGER_AUDIT_ROLE created earlier.

RANGER_AUDIT_ROLE
Logs

Logger Instance Profile

Select the LOG_ROLE instance profile created earlier.

LOG_ROLE

Logs Location Base

Enter the Logs Location Base S3 bucket location created earlier.

my-bucket-/my-dl