AWS resources used by CDP

The following AWS resources are used by CDP and CDP services.

AWS resources created for a CDP environment

The following AWS resources are created for each AWS environment:

Resource Description
Virtual Private Cloud (VPC) If during environment creation you select to have a new VPC and subnets created, then the new VPC and subnets are created on your AWS account. Alternatively, you can provide your own existing VPC and subnets. In both cases, all compute resources that CDP provisions for the environment are provisioned into this specific VPC.
Identity and Access Management (IAM) The cross-account IAM policy that you provided as your credential allows CDP to obtain an access and secret key from AWS, allowing CDP to create resources for your environment and for CDP services such as Data Hub, Data Warehouse, and Machine Learning on your AWS account.
CloudFormation During environment creation, CloudFormation stack is provisioned for FreeIPA to create required resources. This generates an AWS stack which links and describes the resources of your FreeIPA server.
Elastic Compute Cloud (EC2) During environment creation, an EC2 instance (m5.large) is provisioned for the FreeIPA server node. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.

In addition, the following resources are created for each Data Lake (one per environment):

Resource Description
Identity and Access Management (IAM) The cross-account IAM policy that you provided as your credential allows CDP to obtain an access and secret key from AWS, allowing CDP to create resources for Data Lake on your AWS account.
CloudFormation A CloudFormation stack is provisioned for your Data Lake to create instances, disks, and RDS required. This generates an AWS stack which links and describes the resources of your Data Lake cluster.
Elastic Compute Cloud (EC2) EC2 instances with attached storage are provisioned for the Data Lake nodes. Three m5.2xlarge instances are created. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Relational Database Service (RDS) An RDS instance (db.m5.large) is provisioned for the Data Lake. This RDS instance is used for Cloudera Manager, Ranger, and Hive MetaStore.
Simple Storage Service (S3) The existing S3 that you provide during environment creation for the Data Lake is used for Data Lake log storage and workload data storage.
DynamoDB S3 storage is eventually consistent, so for example file listing on S3 might miss entries that were only created very recently. To work around the eventual consistency issues, CDP uses S3Guard, a hadoop extension that stores file names in a DynamoDB table. S3Guard can therefore return the expected file listings, without really having to query the S3 content. One DyamoDB table is provisioned for S3Guard during Data Lake provisioning.

AWS resources used by the Data Hub service

The following AWS resources are used by the Data Hub service:

Resource Description
Identity and Access Management (IAM) The cross-account IAM policy that you registered as your credential and selected for the environment allows CDP to obtain an access and secret key from AWS, allowing the DW service to create resources on your AWS account.
CloudFormation A CloudFormation stack is created for each Data Hub cluster to create instances and disks. This generates an AWS stack which links and describes the resources of your Data Hub cluster.
Elastic Compute Cloud (EC2) An EC2 instance is created for each cluster node. The instance type varies depending on what you selected during Data Hub cluster creation. For each instance, attached storage is provisioned. The storage size and type varies depending on what you selected during cluster creation. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Relational Database Service (RDS) Data Hub connects to the Hive MetaStore database on the RDS instance provisioned for the Data Lake.
Simple Storage Service (S3) The existing S3 bucket that you provided for the Data Lake to use for workload data storage can be accessed from Data Hub clusters via the S3A connector.
DynamoDB Data Hub uses the DynamoDB table created during environment creation for the Data Lake to provide eventual consistency for data accessed from Data Hub clusters.
Auto Scaling Data Hub uses the Auto Scaling service for upscaling clusters.
Key Management Service (KMS) Data Hub uses KMS for encrypting your disks if during Data Hub cluster creation you select to use disk encryption.

AWS resources used by the Data Warehouse (DW) service

The following AWS resources are used by the Data Warehouse service:

Resource Description
Identity and Access Management (IAM) The cross-account IAM policy that you registered as your credential and selected for the environment allows CDP to obtain an access and secret key from AWS, allowing the DW service to create resources on your AWS account.

Furthermore, during DW cluster provisioning, the DW service creates an IAM role that defines access to S3 and other provisioned resources. Such role is then attached to the EC2 instance profile to grant PODs within the kubernetes environment access to these resources.

Certificate Manager DW creates, stores, and maintains a certificate in the AWS certificate manager. This certificate is used to allow HTTPS connections to the external facing endpoints (i.e. for JDBC or the DAS UI). The certificate is signed by a trusted certificate authority, therefore external consumers and browser can securely connect to DW services without having to deal with untrusted CA or self-signed certificates.
CloudFormation The initial deployment of services such as the EKS cluster, the DW-specific RDS database, and S3 buckets is orchestrated through CloudFormation. This generates an AWS stack which links and describes the resources of your DW cluster.
Elastic Compute Cloud (EC2) DW uses EC2 instances as cluster nodes. Two different EC2 instance types (through two different auto scaler groups) are used to support shared services and compute requirements within the cluster: m5.2xlarge for always on components, and r5d.4xlarge for compute nodes (Hive and Impala executors). Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Simple Storage Service (S3) DW creates its own S3 buckets (separate from the environment’s S3 bucket(s)) for storing data and logs.
DynamoDB DW creates its own DynamoDB tables (separate from the environment’s DynamoDB table) to provide eventual consistency for data accessed from DW clusters.
Config Config is used to describe the desired cluster configuration. It then strives to adjust the currently provisioned configuration to reflect the desired/described configuration.
Auto Scaling DW uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the EKS cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Elastic File System (EFS) EFS is used as shared filesystem across PODs to persist data (i.e. result cache).
Elastic Load Balancing (ELB) All inbound traffic is routed through ELB towards the ingress controller of the kubernetes cluster. The ELB is provisioned as a result of the kubernetes ingress controller, which is the single point of entry for services, running in the kubernetes cluster.
Managed Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All DW-deployed PODs are running within an EKS cluster (one per environment).
Key Management Service (KMS) DW encrypts data at rest in S3. This requires an encryption key to be generated and stored in KMS. The key is completely under the control of AWS and cannot be exported or otherwise extracted. The S3 buckets are directly referencing the key within KMS, using it to encrypt the stored data.
Relational Database Service (RDS) During cluster provisioning, DW provisions an RDS instance to be used as backend database system for metadata, managed and stored by the HMS instances, represented by “DB Catalogs”. Each DB Catalog is implemented as separate database within this single RDS instance.
Route53 DW uses Route53 for name resolution of the externally visible cluster names (URL) to the right ingress controller of the EKS cluster.
Security Token Service (STS) STS is used to generate access tokens (based on roles) to access the resources within the environment's VPC.

AWS resources used by the Machine Learning (ML) service

The following AWS resources are used by the Machine Learning service:

Resource Description
Identity and Access Management (IAM) The cross-account IAM policy that you registered as your credential and selected for an environment allows CDP to obtain an access and secret key from AWS, allowing the ML service to create resources on your AWS account.

Furthermore, ML creates additional IAM roles and policies for each cluster. Such roles are then attached to the EC2 instance profile.

Amazon Elastic Block Store (EBS) ML uses EBS as block storage.
Route 53 ML uses Route53 for name resolution of the externally visible cluster names (URL) to the right ingress controller of the cluster.
Elastic Load Balancer (ELB) ML uses Classic Load Balancers for redirecting traffic to EC2 instances.
Key Management Service (KMS) ML uses KMS for encrypting your disks if you select to use disk encryption.
Amazon Elastic File System (EFS) EFS is used for project file storage.
Elastic Compute Cloud (EC2) ML uses EC2 instances as cluster nodes. Three different EC2 instance types (through three different auto scaler groups) are used to support ML infra and compute requirements within the kubernetes cluster. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Auto Scaling ML uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Simple Storage Service (S3) ML uses S3 as the primary store for data and logs.
DynamoDB ML uses the DynamoDB table created for the attached Data Lake to provide eventual consistency for data accessed from ML clusters.
Security Token Service (STS) STS is used to generate access tokens (based on roles) to access the resources within the environment's VPC.
Managed Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All PODs are running within an EKS cluster (one per environment).