Overview of AWS resources used by Cloudera

The following AWS resources are used by Cloudera and Cloudera services.

AWS resources created for a Cloudera environment

When a Cloudera environment is created, a FreeIPA cluster and a Data Lake cluster are created.

The following AWS resources are created for FreeIPA (one per environment):

Resource Description
Virtual Private Cloud (VPC) If during environment creation you select to have a new VPC and subnets created, then the new VPC and subnets are created on your AWS account. Alternatively, you can provide your own existing VPC and subnets. In both cases, all the resources that Cloudera provisions for the environment are provisioned into this specific VPC. For example, the EC2 instances provisioned for Cloudera Data Hub or Cloudera Data Warehouse are provisioned into that VPC.
Identity and Access Management (IAM) The cross-account IAM policy that you provided as your credential allows Cloudera to obtain an access and secret key from AWS, allowing Cloudera to create resources for your environment and for Cloudera services such as Cloudera Data Hub, Cloudera Data Warehouse, and Cloudera AI on your AWS account.
CloudFormation During environment creation, CloudFormation stack is provisioned for FreeIPA to create required resources. This generates an AWS stack which links and describes the resources of your FreeIPA server. Multi-AZ deployments do not use a CloudFormation template for VM creation. Neither autoscaling groups or launch templates are created. The cluster resources are managed individually using AWS native components (for example, EC2 instances).
Auto Scaling FreeIPA uses the Auto Scaling service for upscaling FreeIPA, except in a multi-AZ deployment.
Elastic Compute Cloud (EC2) During environment creation, two or three m5.large EC2 instances are provisioned for the FreeIPA HA server by default. The number of instances depends on the selected Data Lake type. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.

In addition, the following resources are created for each Data Lake (one per environment):

Resource Description
CloudFormation A CloudFormation stack is provisioned for your Data Lake to create instances, disks, and RDS required. This generates an AWS stack which links and describes the resources of your Data Lake cluster. Multi-AZ deployments do not use a CloudFormation template for VM creation. Neither autoscaling groups nor launch templates are created. In a multi-AZ setup the cluster resources are managed individually using AWS native components (for example, EC2 instances).
Auto Scaling The Data Lake uses the Auto Scaling service for upscaling clusters, except in a multi-AZ deployment.
Elastic Compute Cloud (EC2) EC2 instances with attached storage are provisioned for the Data Lake nodes:
  • Light duty: Two instances are provisioned: One t3.medium instance (IDBroker) and one m5.2xlarge instance (Data Lake Master node).
  • Medium duty: Ten instances are provisioned: Two t3.medium instances (IDBroker), three m5.xlarge instances (two Data Lake Master nodes and one Auxiliary node), and five m5.2xlarge instances (three Data Lake Core nodes and two Gateway nodes).
Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Relational Database Service (RDS) An RDS instance (db.m5.large) is provisioned for the Data Lake. This RDS instance is used for Cloudera Manager, Ranger, and Hive MetaStore.
Simple Storage Service (S3) The existing S3 that you provide during environment creation for the Data Lake is used for Data Lake log storage and workload data storage.

AWS resources created for Cloudera Data Hub

The following AWS resources are created for the Cloudera Data Hub service:

Resource Description
CloudFormation A CloudFormation stack is created for each Cloudera Data Hub cluster to create instances and disks. This generates an AWS stack which links and describes the resources of your Cloudera Data Hub cluster. Multi-AZ deployments do not use a CloudFormation template for VM creation. Neither autoscaling groups or launch templates are created. In a multi-AZ setup the cluster resources are managed individually using AWS native components (for example, EC2 instances).
Elastic Compute Cloud (EC2) An EC2 instance is created for each cluster node. The instance type varies depending on what you selected during Cloudera Data Hub cluster creation. For each instance, attached storage is provisioned. The storage size and type varies depending on what you selected during cluster creation. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances. For a list of supported EC2 instance types, refer to Cloudera Public Cloud service rates.
Relational Database Service (RDS) Cloudera Data Hub connects to the Hive MetaStore database on the RDS instance provisioned for the Data Lake.
Simple Storage Service (S3) The existing S3 bucket that you provided for the Data Lake to use for workload data storage can be accessed from Cloudera Data Hub clusters via the S3A connector.
Auto Scaling Cloudera Data Hub uses the Auto Scaling service for upscaling clusters, except in a multi-AZ deployment.
Key Management Service (KMS) Cloudera Data Hub uses KMS for encrypting your disks if during Cloudera Data Hub cluster creation you select to use disk encryption.

AWS resources created for Cloudera Data Engineering

The following AWS resources are created for the Cloudera Data Engineering service:

Resource Description
CloudFormation The initial deployment of services such as the EKS cluster is orchestrated through CloudFormation. This generates an AWS stack which links and describes the resources of your Cloudera Data Engineering cluster.
Elastic Compute Cloud (EC2) Cloudera Data Engineering uses EC2 instances as cluster nodes. For a list of supported EC2 instance types, refer to Cloudera Public Cloud service rates.
Auto Scaling Cloudera Data Engineering uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the EKS cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Elastic Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All PODs are running within an EKS cluster (one per environment).
ELB Classic Load Balancer Cloudera Data Engineering uses Classic Load Balancers for redirecting traffic to EC2 instances.
Key Management Service (KMS) Cloudera Data Engineering uses KMS for encrypting your disks if you select to use disk encryption.
Elastic Block Store (EBS) Cloudera Data Engineering uses EBS for persistent instance storage.
Elastic File System (EFS) Cloudera Data Engineering uses EFS for persistent service and virtual cluster storage.
Relational Database Service (RDS) Cloudera Data Engineering uses RDS for provisioning relational databases.

AWS resources created for Cloudera DataFlow

The following AWS resources are created for the Cloudera DataFlow service:

Resource Description
CloudFormation The initial deployment of services such as the EKS cluster is orchestrated through CloudFormation. This generates an AWS stack which links and describes the resources of your Cloudera DataFlow cluster.
Elastic Compute Cloud (EC2) Cloudera DataFlow uses EC2 instances as cluster nodes. For a list of supported EC2 instance types, refer to Cloudera Public Cloud service rates.
Auto Scaling Cloudera DataFlow uses AWS AutoScaler to add or remove EC2 instances to the Kubernetes cluster. Whenever the Kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the EKS cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Elastic Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All PODs are running within an EKS cluster (one per environment).
ELB Classic Load Balancer Cloudera DataFlow uses Classic Load Balancers for redirecting traffic to EC2 instances.
Elastic Block Store (EBS) Cloudera DataFlow uses EBS for persistent instance storage.
Relational Database Service (RDS) Cloudera DataFlow uses RDS for provisioning relational databases.

AWS resources created for Cloudera Data Warehouse

The following AWS resources are created for the Cloudera Data Warehouse service:

Resource Description
Identity and Access Management (IAM) During Cloudera Data Warehouse cluster provisioning, the Cloudera Data Warehouse service creates an IAM role that defines access to S3 and other provisioned resources. Such role is then attached to the EC2 instance profile to grant PODs within the kubernetes environment access to these resources.
Certificate Manager Cloudera Data Warehouse creates, stores, and maintains a certificate in the AWS certificate manager. This certificate is used to allow HTTPS connections to the external facing endpoints (i.e. for JDBC or the DAS UI). The certificate is signed by a trusted certificate authority, therefore external consumers and browser can securely connect to Cloudera Data Warehouse services without having to deal with untrusted CA or self-signed certificates.
CloudFormation The initial deployment of services such as the EKS cluster, the Cloudera Data Warehouse-specific RDS database, and S3 buckets is orchestrated through CloudFormation. This generates an AWS stack which links and describes the resources of your Cloudera Data Warehouse cluster.
Elastic Compute Cloud (EC2) Cloudera Data Warehouse uses EC2 instances as cluster nodes. Two different EC2 instance types (through two different auto scaler groups) are used to support shared services and compute requirements within the cluster: m5.2xlarge for always on components, and r5d.4xlarge for compute nodes (Hive and Impala executors). Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances. For a list of supported EC2 instance types, refer to Cloudera Public Cloud service rates.
Simple Storage Service (S3) Cloudera Data Warehouse creates its own S3 buckets (separate from the environment’s S3 bucket(s)) for storing data and logs.
Auto Scaling Cloudera Data Warehouse uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the EKS cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Elastic File System (EFS) EFS is used as shared filesystem across PODs to persist data (i.e. result cache).
Elastic Load Balancing (ELB) All inbound traffic is routed through ELB towards the ingress controller of the kubernetes cluster. The ELB is provisioned as a result of the kubernetes ingress controller, which is the single point of entry for services, running in the kubernetes cluster.
Managed Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All DW-deployed PODs are running within an EKS cluster (one per environment).
Key Management Service (KMS) Cloudera Data Warehouse encrypts data at rest in S3. This requires an encryption key to be generated and stored in KMS. The key is completely under the control of AWS and cannot be exported or otherwise extracted. The S3 buckets are directly referencing the key within KMS, using it to encrypt the stored data.
Relational Database Service (RDS) During cluster provisioning, Cloudera Data Warehouse provisions an RDS instance to be used as backend database system for metadata, managed and stored by the HMS instances, represented by “DB Catalogs”. Each DB Catalog is implemented as separate database within this single RDS instance.
Security Token Service (STS) STS is used to generate access tokens (based on roles) to access the resources within the environment's VPC.

AWS resources created for Cloudera AI

The following AWS resources are created for the Cloudera AI service:

Resource Description
Identity and Access Management (IAM) Cloudera AI creates additional IAM roles and policies for each cluster. Such roles are then attached to the EC2 instance profile.
Elastic Block Store (EBS) Cloudera AI uses EBS as block storage.
Elastic Load Balancer (ELB) Cloudera AI uses Classic Load Balancers for redirecting traffic to EC2 instances.
Key Management Service (KMS) Cloudera AI uses KMS for encrypting your disks if you select to use disk encryption.
Elastic File System (EFS) EFS is used for project file storage.
Elastic Compute Cloud (EC2) Cloudera AI uses EC2 instances as cluster nodes. Three different EC2 instance types (through three different auto scaler groups) are used to support Cloudera AI infra and compute requirements within the kubernetes cluster. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances. For a list of supported EC2 instance types, refer to Cloudera Public Cloud service rates.
Auto Scaling Cloudera AI uses AWS AutoScaler to add or remove EC2 instances to the kubernetes cluster. Whenever the kubernetes cluster is running low on resources, new EC2 instances are provisioned and jointed into the cluster. Whenever the AutoScaler detects an over-provisioning of resources, it removes and suspends EC2 instances.
Simple Storage Service (S3) Cloudera AI uses S3 as the primary store for data and logs.
Security Token Service (STS) STS is used to generate access tokens (based on roles) to access the resources within the environment's VPC.
Managed Kubernetes Service (EKS) EKS is the AWS implementation of the kubernetes stack. All PODs are running within an EKS cluster (one per environment).

AWS resources created for Cloudera Operational Database

The following AWS resources are created for the Cloudera Operational Database service:

Resource Description
CloudFormation A CloudFormation stack is created for each Cloudera Operational Database database to create instances and disks. This generates an AWS stack which links and describes the resources of your Cloudera Operational Database database.
Elastic Compute Cloud (EC2) An EC2 instance is created for each node. The instance type, storage size, and storage type is determined automatically by Cloudera Operational Database. Furthermore, security groups with the rules specified during environment creation are provisioned to define inbound and outbound access to the instances.
Simple Storage Service (S3) The existing S3 bucket that you provided for the Data Lake to use for workload data storage can be accessed from Cloudera Operational Database database via the S3A connector.
Relational Database Service (RDS) An RDS instance is provisioned for the Cloudera Operational Database. This RDS instance is used by Cloudera Manager.