Overview of AWS resources used by CDP

The following AWS resources are used by CDP and CDP services.

AWS resources created for a CDP environment

When a CDP environment is created, a FreeIPA cluster and a Data Lake cluster are created.

The following AWS resources are created for FreeIPA (one per environment):

Resource Description
Virtual Private Cloud (VPC) If during environment creation you select to have a new VPC and subnets created, then the new VPC and subnets are created on your AWS account. Alternatively, you can provide your own existing VPC and subnets. In both cases, all the resources that CDP provisions for the environment are provisioned into this specific VPC. For example, the EC2 instances provisioned for Data Hub or Data Warehouse are provisioned into that VPC.
Identity and Access Management (IAM) The cross-account IAM policy that you provided as your credential allows CDP to obtain an access and secret key from AWS, allowing CDP to create resources for your environment and for CDP services such as Data Hub, Data Warehouse, and Machine Learning on your AWS account.
CloudFormation During environment creation, CloudFormation stack is provisioned for FreeIPA to create required resources. This generates an AWS stack which links and describes the resources of your FreeIPA server. Multi-AZ deployments do not use a CloudFormation template for VM creation. Neither autoscaling groups or launch templates are created. The cluster resources are managed individually using AWS native components (for example, EC2 instances).
Auto Scaling FreeIPA uses the Auto Scaling service for upscaling FreeIPA, except in a multi-AZ deployment.
Elastic Compute Cloud (EC2) During environment creation, two or three m5.large EC2 instances are provisioned for the FreeIPA HA server by default. The number of instances depends on the selected Data Lake type. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.

In addition, the following resources are created for each Data Lake (one per environment):

Resource Description
CloudFormation A CloudFormation stack is provisioned for your Data Lake to create instances, disks, and RDS required. This generates an AWS stack which links and describes the resources of your Data Lake cluster. Multi-AZ deployments do not use a CloudFormation template for VM creation. Neither autoscaling groups nor launch templates are created. In a multi-AZ setup the cluster resources are managed individually using AWS native components (for example, EC2 instances).
Auto Scaling The Data Lake uses the Auto Scaling service for upscaling clusters, except in a multi-AZ deployment.
Elastic Compute Cloud (EC2) EC2 instances with attached storage are provisioned for the Data Lake nodes:
  • Light duty: Two instances are provisioned: One t3.medium instance (IDBroker) and one m5.2xlarge instance (Data Lake Master node).
  • Medium duty: Ten instances are provisioned: Two t3.medium instances (IDBroker), three m5.xlarge instances (two Data Lake Master nodes and one Auxiliary node), and five m5.2xlarge instances (three DataLake Core nodes and two Gateway nodes).
Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Relational Database Service (RDS) An RDS instance (db.m5.large) is provisioned for the Data Lake. This RDS instance is used for Cloudera Manager, Ranger, and Hive MetaStore.
Simple Storage Service (S3) The existing S3 that you provide during environment creation for the Data Lake is used for Data Lake log storage and workload data storage.

AWS resources created for Data Hub

The following AWS resources are created for the Data Hub service:

Resource Description
CloudFormation A CloudFormation stack is created for each Data Hub cluster to create instances and disks. This generates an AWS stack which links and describes the resources of your Data Hub cluster. Multi-AZ deployments do not use a CloudFormation template for VM creation. Neither autoscaling groups or launch templates are created. In a multi-AZ setup the cluster resources are managed individually using AWS native components (for example, EC2 instances).
Elastic Compute Cloud (EC2) An EC2 instance is created for each cluster node. The instance type varies depending on what you selected during Data Hub cluster creation. For each instance, attached storage is provisioned. The storage size and type varies depending on what you selected during cluster creation. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances. For a list of supported EC2 instance types, refer to Cloudera Data Platform (CDP) Public Cloud service rates.
Relational Database Service (RDS) Data Hub connects to the Hive MetaStore database on the RDS instance provisioned for the Data Lake.
Simple Storage Service (S3) The existing S3 bucket that you provided for the Data Lake to use for workload data storage can be accessed from Data Hub clusters via the S3A connector.
Auto Scaling Data Hub uses the Auto Scaling service for upscaling clusters, except in a multi-AZ deployment.
Key Management Service (KMS) Data Hub uses KMS for encrypting your disks if during Data Hub cluster creation you select to use disk encryption.

AWS resources created for Data Engineering

The following AWS resources are created for the Cloudera Data Engineering (CDE) service:

Resource Description
CloudFormation The initial deployment of services such as the EKS cluster is orchestrated through CloudFormation. This generates an AWS stack which links and describes the resources of your CDE cluster.
Elastic Compute Cloud (EC2) CDE uses EC2 instances as cluster nodes. For a list of supported EC2 instance types, refer to Cloudera Data Platform (CDP) Public Cloud service rates.
Auto Scaling CDE uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the EKS cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Elastic Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All PODs are running within an EKS cluster (one per environment).
ELB Classic Load Balancer CDE uses Classic Load Balancers for redirecting traffic to EC2 instances.
Key Management Service (KMS) CDE uses KMS for encrypting your disks if you select to use disk encryption.
Elastic Block Store (EBS) CDE uses EBS for persistent instance storage.
Elastic File System (EFS) CDE uses EFS for persistent service and virtual cluster storage.
Relational Database Service (RDS) CDE uses RDS for provisioning relational databases.

AWS resources created for DataFlow

The following AWS resources are created for the DataFlow (DF) service:

Resource Description
CloudFormation The initial deployment of services such as the EKS cluster is orchestrated through CloudFormation. This generates an AWS stack which links and describes the resources of your DataFlow cluster.
Elastic Compute Cloud (EC2) DataFlow uses EC2 instances as cluster nodes. For a list of supported EC2 instance types, refer to Cloudera Data Platform (CDP) Public Cloud service rates.
Auto Scaling DataFlow uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the EKS cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Elastic Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All PODs are running within an EKS cluster (one per environment).
ELB Classic Load Balancer DataFlow uses Classic Load Balancers for redirecting traffic to EC2 instances.
Elastic Block Store (EBS) DataFlow uses EBS for persistent instance storage.
Relational Database Service (RDS) DataFlow uses RDS for provisioning relational databases.

AWS resources created for Data Warehouse

The following AWS resources are created for the Cloudera Data Warehouse (CDW) service:

Resource Description
Identity and Access Management (IAM) During CDW cluster provisioning, the DW service creates an IAM role that defines access to S3 and other provisioned resources. Such role is then attached to the EC2 instance profile to grant PODs within the kubernetes environment access to these resources.
Certificate Manager CDW creates, stores, and maintains a certificate in the AWS certificate manager. This certificate is used to allow HTTPS connections to the external facing endpoints (i.e. for JDBC or the DAS UI). The certificate is signed by a trusted certificate authority, therefore external consumers and browser can securely connect to DW services without having to deal with untrusted CA or self-signed certificates.
CloudFormation The initial deployment of services such as the EKS cluster, the CDW-specific RDS database, and S3 buckets is orchestrated through CloudFormation. This generates an AWS stack which links and describes the resources of your DW cluster.
Elastic Compute Cloud (EC2) CDW uses EC2 instances as cluster nodes. Two different EC2 instance types (through two different auto scaler groups) are used to support shared services and compute requirements within the cluster: m5.2xlarge for always on components, and r5d.4xlarge for compute nodes (Hive and Impala executors). Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances. For a list of supported EC2 instance types, refer to Cloudera Data Platform (CDP) Public Cloud service rates.
Simple Storage Service (S3) CDW creates its own S3 buckets (separate from the environment’s S3 bucket(s)) for storing data and logs.
Auto Scaling CDW uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the EKS cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Elastic File System (EFS) EFS is used as shared filesystem across PODs to persist data (i.e. result cache).
Elastic Load Balancing (ELB) All inbound traffic is routed through ELB towards the ingress controller of the kubernetes cluster. The ELB is provisioned as a result of the kubernetes ingress controller, which is the single point of entry for services, running in the kubernetes cluster.
Managed Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All DW-deployed PODs are running within an EKS cluster (one per environment).
Key Management Service (KMS) CDW encrypts data at rest in S3. This requires an encryption key to be generated and stored in KMS. The key is completely under the control of AWS and cannot be exported or otherwise extracted. The S3 buckets are directly referencing the key within KMS, using it to encrypt the stored data.
Relational Database Service (RDS) During cluster provisioning, DW provisions an RDS instance to be used as backend database system for metadata, managed and stored by the HMS instances, represented by “DB Catalogs”. Each DB Catalog is implemented as separate database within this single RDS instance.
Security Token Service (STS) STS is used to generate access tokens (based on roles) to access the resources within the environment's VPC.

AWS resources created for Machine Learning

The following AWS resources are created for the Cloudera Machine Learning (CML) service:

Resource Description
Identity and Access Management (IAM) CML creates additional IAM roles and policies for each cluster. Such roles are then attached to the EC2 instance profile.
Elastic Block Store (EBS) CML uses EBS as block storage.
Elastic Load Balancer (ELB) CML uses Classic Load Balancers for redirecting traffic to EC2 instances.
Key Management Service (KMS) CML uses KMS for encrypting your disks if you select to use disk encryption.
Elastic File System (EFS) EFS is used for project file storage.
Elastic Compute Cloud (EC2) CML uses EC2 instances as cluster nodes. Three different EC2 instance types (through three different auto scaler groups) are used to support CML infra and compute requirements within the kubernetes cluster. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances. For a list of supported EC2 instance types, refer to Cloudera Data Platform (CDP) Public Cloud service rates.
Auto Scaling CML uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Simple Storage Service (S3) CML uses S3 as the primary store for data and logs.
Security Token Service (STS) STS is used to generate access tokens (based on roles) to access the resources within the environment's VPC.
Managed Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All PODs are running within an EKS cluster (one per environment).

AWS resources created for Operational Database

The following AWS resources are created for the Cloudera Operational Database (COD) service:

Resource Description
CloudFormation A CloudFormation stack is created for each COD database to create instances and disks. This generates an AWS stack which links and describes the resources of your COD database.
Elastic Compute Cloud (EC2) An EC2 instance is created for each node. The instance type, storage size, and storage type is determined automatically by COD. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Simple Storage Service (S3) The existing S3 bucket that you provided for the Data Lake to use for workload data storage can be accessed from COD database via the S3A connector.
Relational Database Service (RDS) An RDS instance is provisioned for the COD. This RDS instance is used by Cloudera Manager.