Minimal setup for cloud storage

This minimal secure setup uses one S3 bucket for each Data Lake and multiple IAM roles and policies.

You may choose a different setup, for example one using multiple buckets. Similarly, the IAM role and policy setup and names are just examples and you may choose a different setup. It is possible to have a setup with fewer roles and policies with broader access rights; however, such setup may not be secure for a production environment.

The example setup includes:

  • S3 bucket - One S3 bucket with a sub-directory named after your data lake such as s3://my-bucket/my-dl. During Data Lake creation, CDP will automatically create a location for log storage and for Ranger audits:
    • The location for log storage will be created based on what you specify during environment creation as Logs Location Base. In this example, our Logs Location Base is s3://my-bucket/my-dl/logs
    • The location for Ranger audits will be based on what you specify during environment creation as Storage Location Base and the suffix /ranger/audit. The directory structure will be created automatically by CDP within the Storage Location Base directory. In this example, the Storage Location Base is s3://my-bucket/my-dl and therefore the Ranger audits location created automatically by CDP is s3://my-bucket/my-dl/ranger/audit
    • In our example, the Logs Location Base and Storage Location Base happen to be in the same bucket, but this is not required. If using multiple buckets, the bucket-policy-s3access needs to have the additional bucket specified.
  • Four IAM roles:

    • IDBROKER_ROLE
    • LOG_ROLE
    • RANGER_AUDIT_ROLE
    • DATALAKE_ADMIN_ROLE
  • Eight IAM policies:

    • One AssumeRole policy (idbroker-assume-role) that can be used by the IDBroker component of the Data Lake cluster to assume each of the following roles:

      • RANGER_AUDIT_ROLE
      • DATALAKE_ADMIN_ROLE
    • Two trust policies
      • ec2-role-trust-policy
      • idbroker-role-trust-policy
    • Two shared policies for accessing S3 and DynamoDB:
      • bucket-policy-s3access
      • dynamodb-policy
    • Three policies for specific bucket directory access:
      • Log storage (log-policy-s3access)
      • Ranger audit (ranger-audit-policy-s3access)
      • Data Lake admin (datalake-admin-policy-s3access)

The following diagram summarizes the roles, policies, and S3 bucket directories in this example setup:

The following table lists and describes the IAM roles and IAM policies that need to be created on AWS, and describes which policies should be assigned to which roles (as presented in the diagram, in some cases policies should be assigned to multiple roles). The policy definitions are provided in a separate section below the table:

Role Permissions policy Trust policy Description
IDBROKER_ROLE idbroker-assume-role-policy ec2-role-trust-policy The permissions policy must, at a minimum, allow the IDBROKER_ROLE to assume the RANGER_AUDIT_ROLE and the DATALAKE_ADMIN_ROLE. In addition, this policy must also allow the IDBROKER_ROLE to assume any other role for which a user or group mapping exists in the IDBroker.

The trust policy allows the role to be assumed by the IDBroker EC2 instance.

LOG_ROLE log-policy-s3access

bucket-policy-s3access

ec2-role-trust-policy This role uses the two permissions policies to provide CDP with access to the specific location called Logs Location Base for logs (s3://my-bucket/my-dl/logs).

The trust policy allows the role to be assumed by EC2 instances in the cluster.

RANGER_AUDIT_ROLE ranger-audit-policy-s3access

bucket-policy-s3access

dynamodb-policy

idbroker-role-trust-policy This role uses the three permissions policies to provide write access to the Ranger audit sub-directory that CDP creates within the Storage Location Base (s3://my-bucket/my-dl/ranger/audit).

The trust policy allows the role to be assumed by IDBroker.

DATALAKE_ADMIN_ROLE datalake-admin-policy-s3access

bucket-policy-s3access

dynamodb-policy

idbroker-role-trust-policy This role uses the three permissions policies to provide the Data Lake admin with full access to the whole Storage Location Base (s3://my-bucket/my-dl).

The trust policy allows the role to be assumed by IDBroker.

IAM policy definitions

Use the following IAM policy definitions for defining IAM policies.

Note that:

  • The policy definitions refer to roles by using the convention presented in the table above. If the IAM roles that you created use different names, you should update these names in the policy definitions below.
  • The policy definitions refer to the example S3 subdirectories presented above. If the S3 bucket sub-directories that you created use different names, you should update these names in the policy definitions below.

While creating these IAM policies, make sure to replace the following with actual values:

  • ${AWS_ACCOUNT_ID} - Your AWS account ID
  • ${DATALAKE_BUCKET} - Your S3 bucket. For example my-bucket
  • ${STORAGE_LOCATION_BASE} - Path to your Data Lake directory in the S3 bucket specified as ${DATALAKE_BUCKET}/SOME_PATH. For example my-bucket/my-dl
  • ${LOGS_LOCATION_BASE} - Path to your S3 location for logs. For example my-bucket/ml-dl/logs
  • ${DYNAMODB_TABLE_NAME} - The name of your DynamoDB table used for S3Guard. This should correspond to your DynamoDB Table Name provided under Enable S3Guard during environment creation.

idbroker-assume-role-policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

log-policy-s3access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
              ],
            "Resource": "arn:aws:s3:::${LOGS_LOCATION_BASE}/*"
        }
    ]
}

ranger-audit-policy-s3access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "FullObjectAccessUnderAuditDir",
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Put*"
            ],
            "Resource": "arn:aws:s3:::${STORAGE_LOCATION_BASE}/ranger/audit/*"
        },
        {
            "Sid": "LimitedAccessToDataLakeBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload"
            ],
            "Resource": "arn:aws:s3:::${DATALAKE_BUCKET}"
        }
    ]
}

datalake-admin-policy-s3access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::${STORAGE_LOCATION_BASE}",
                "arn:aws:s3:::${STORAGE_LOCATION_BASE}/*"
                        ]
        }
    ]
}

bucket-policy-s3access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetAccountPublicAccessBlock",
                "s3:ListAllMyBuckets",
                "s3:ListJobs",
                "s3:CreateJob",
                "s3:HeadBucket"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowListingOfDataLakeFolder",
            "Action": [
                "s3:ListBucketByTags",
                "s3:GetLifecycleConfiguration",
                "s3:GetBucketTagging",
                "s3:GetInventoryConfiguration",
                "s3:GetObjectVersionTagging",
                "s3:ListBucketVersions",
                "s3:GetBucketLogging",
                "s3:ListBucket",
                "s3:GetAccelerateConfiguration",
                "s3:GetBucketPolicy",
                "s3:GetObjectVersionTorrent",
                "s3:GetObjectAcl",
                "s3:GetEncryptionConfiguration",
                "s3:GetBucketRequestPayment",
                "s3:GetObjectVersionAcl",
                "s3:GetObjectTagging",
                "s3:GetMetricsConfiguration",
                "s3:GetBucketPublicAccessBlock",
                "s3:GetBucketPolicyStatus",
                "s3:ListBucketMultipartUploads",
                "s3:GetBucketWebsite",
                "s3:GetBucketVersioning",
                "s3:GetBucketAcl",
                "s3:GetBucketNotification",
                "s3:GetReplicationConfiguration",
                "s3:ListMultipartUploadParts",
                "s3:GetObject",
                "s3:GetObjectTorrent",
                "s3:GetBucketCORS",
                "s3:GetAnalyticsConfiguration",
                "s3:GetObjectVersionForReplication",
                "s3:GetBucketLocation",
                "s3:GetObjectVersion"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::${DATALAKE_BUCKET}",
                "arn:aws:s3:::${DATALAKE_BUCKET}/*"
            ]
        }
    ]
}

dynamodb-policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:List*",
                "dynamodb:DescribeReservedCapacity*",
                "dynamodb:DescribeLimits",
                "dynamodb:DescribeTimeToLive"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:BatchGetItem",
                "dynamodb:BatchWriteItem",
                "dynamodb:DeleteItem",
                "dynamodb:DescribeTable",
                "dynamodb:GetItem",
                "dynamodb:PutItem",
                "dynamodb:Query",
                "dynamodb:UpdateItem",
                "dynamodb:CreateTable",
                "dynamodb:DeleteTable",
                "dynamodb:Scan",
                "dynamodb:TagResource",
                "dynamodb:UntagResource",
                "dynamodb:UpdateTable"
            ],
            "Resource": "arn:aws:dynamodb:*:*:table/${DYNAMODB_TABLE_NAME}"
        } 
    ]
}

ec2-role-trust-policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

idbroker-role-trust-policy

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "arn:aws:iam::${AWS_ACCOUNT_ID}:role/${IDBROKER_ROLE}"
        },
        "Action": "sts:AssumeRole"
      }
    ]
}

Creating IAM resources

You can create IAM roles and policies from the IAM console on AWS or from AWS CLI.

Providing the parameters in the UI

Once you’ve created the bucket and the required instance profiles, provide the information related to these resources in the Register Environment wizard as follows:

Logs Storage and Audits

Data Access