Creating and managing CDP deployments

In this article, we provide an overview of best practices for deploying CDP and demonstrate how to create and manage CDP deployments through a simple yet powerful Terraform framework.

If you are looking for a high-level overview of best practices for setting up CDP by using our standardized Terraform-based CDP deployment patterns, continue reading this article.

What is a CDP deployment

A CDP deployment is a set of CDP management services and data services including related cloud provider resources that exist in your AWS, Azure, or GCP account. It is a combination of the cloud infrastructure that may span multiple cloud providers and regions, and the CDP platform that abstracts this underlying cloud provider infrastructure into an integrated, unified, logical data platform layer.

Each CDP deployment consists of CDP services and the underlying cloud provider resources. The cloud provider resources include the resources that you precreate up-front and the resources that CDP creates when provisioning CDP services. The CDP services include a CDP environment and workload services such as Data Hubs and data services (Cloudera Data Warehouse, Cloudera Machine Learning, and so on).

In order for CDP to be deployed, a set of cloud provider prerequisites needs to be provided first, including a virtual network and subnets, storage accounts, and IAM roles and policies. These cloud provider prerequisites are typically customer-managed and exist in the cloud provider account independently of CDP services. As such, they may be shared with other, non-Cloudera cloud services.

Once the cloud provider prerequisites are present, a CDP environment can be deployed in the virtual network. Once your CDP environment is up and running, your core CDP and cloud provider infrastructure is in place and you can start creating Data Hubs and data services in order to run workloads. When these services are created, additional cloud provider resources such as VM instances, security groups, and load balancers are deployed in your cloud account. For each service, you can select which subnets of the underlying virtual network and what storage locations within your specified storage accounts they should use.

These three high-level deployment steps are described in the following diagram:

CDP deployment can be performed by using either CDP web interface or CDP CLI, or Terraform-based CDP deployment patterns. Continue reading to learn about deploying CDP using Terraform.

Terraform module for deploying CDP

The Terraform Module for CDP Prerequisites on AWS contains Terraform resource files and example variable definition files for creating the prerequisite cloud provider resources. The module also includes an Ansible playbook for deploying a CDP environment, Data Lake, and FreeIPA, and a set of supporting scripts and template files to interact with the CDP tenant and APIs. The module supports the network deployment patterns described in CDP deployment pattern definitions.

The module creates the prerequisite AWS resources that are required for deploying CDP. These resources are created using the Terraform AWS provider. They include a VPC configured with public and private subnets according to the network deployment pattern specified, Amazon S3 data and log buckets for the CDP environment, and a number of AWS IAM roles and policies to enable fine-grained permissions for access to the CDP Control Plane and AWS services. These are required for deployment and use of the CDP environment.

The deployment of the CDP environment is performed using an Ansible playbook which is invoked from Terraform using a local-exec provisioner. The playbook utilizes the cloudera.cloud Ansible collection to create the necessary CDP resources. These include a cross-account CDP credential for AWS, a CDP environment on AWS, Environment Admin and Environment User CDP groups associated with the environment, a set of CDP IDBroker mappings for AWS, and a CDP Data Lake.

In our Deploy CDP on AWS using Terraform onboarding guide, we use this module to quickly deploy CDP.

CDP deployment patterns

To simplify the task of defining and creating CDP deployments, we provide and describe a set of predefined target architectures recommended by Cloudera. These target architectures are called deployment patterns.

In Cloudera’s Terraform framework, each pattern is represented by a deployment template that allows you to quickly instantiate one of the reference deployments. The templates can be used as a starting point and modified according to your needs. You can learn more about the recommended configurations of CDP Public Cloud from the documentation of our end-to-end deployment patterns as well as our network reference architectures.

Currently, we provide templates that represent the following deployment patterns, each matching a different use case:
Private Production-like setup fully deployed on private subnets without public IPs or direct outbound internet access. Demonstrates a possible production deployment with typical network security features enabled.
Semi-private Production-like setup with access over the public internet to the user interfaces and API endpoints only. It serves as a reference for production deployments without the need for configuring VPNs, jump hosts and user-defined routing for outbound (egress) traffic.
Public Simple setup with access over public internet to all endpoints and with a minimal footprint. It can be used for quick testing, tutorial, demonstration, or simply to understand the internal workings of CDP Public Cloud. This setup is not secure enough for production, but can be used for proof of concept.

CDP deployment pattern definitions

Deployment patterns are predefined architectures recommended by Cloudera that simplify the task of defining and creating CDP deployments. There are many options available for deploying CDP, but as a best practice, Cloudera recommends that you use one of the following three deployment patterns: private, semi-private, or public.

These patterns are based on the identically named network reference architectures and extend them, by incorporating Cloudera’s recommended configuration for deploying CDP in multiple availability zones, selecting the Data Lake scale, configuring storage access policies and setting up fine-grained access control.

As can be expected, each of these deployment patterns brings a unique trade-off among various aspects, such as ease of setup, security provided, workloads supported, and so on. Read the following content to understand what specific networking, IAM, and storage cloud provider configurations, and CDP configurations are applied as part of the supported deployment patterns.

Cloud provider prerequisites

This section summarizes the networking, IAM, and storage cloud provider configurations that are made when CDP is deployed based on one of the deployment patterns.

Networking

Private Semi-private Public
VPC A new VPC is provisioned in your cloud provider account. A new VPC is provisioned in your cloud provider account. A new VPC is provisioned in your cloud provider account.
Subnets

New subnets are created by CDP:

3x /24 subnets (external, public) - only used to route egress traffic

3x /19 subnets (internal, private)

New subnets are created by CDP:

3x /24 subnets (external, public)

3x /19 subnets (internal, private)

New subnets are created by CDP:

3x /16 subnets

Public IPs No public IPs are created. 5 Elastic IPs are created by CDP. 3 Elastic IPs are created by CDP.
Egress traffic One AWS Internet Gateway One AWS Internet Gateway One AWS Internet Gateway
Ingress traffic AWS NAT Gateways (1 per public subnet) AWS NAT Gateways (1 per public subnet) AWS NAT Gateways (1 per subnet)
Security groups 2 security groups (Ports 443, 22) 2 security groups (Ports 443, 22) 2 security groups (Rules set up based on user input/configuration)

Identity and access management

Private Semi-private Public
Federated access Cross-account policy Cross-account policy Cross-account policy
Storage access IAM roles, policies, and instance profiles IAM roles, policies, and instance profiles IAM roles, policies, and instance profiles

Storage

Private Semi-private Public
S3 buckets 3 base locations 3 base locations 3 base locations

Environment and Data Lake

This section summarizes CDP networking, security, and other configurations that are made when CDP is deployed based on one of the deployment patterns.

CDP networking setup

Private Semi-private Public
Communication with CDP Control Plane Reverse HTTPS tunnel (CCM), no private link Reverse HTTPS tunnel (CCM), no private link Reverse HTTPS tunnel (CCM), no private link
Load balancer and node placement 2 load balancers are placed in private subnets. All cluster nodes are placed in private subnets. 2 load balancers are placed in the external subnets and all cluster nodes are placed in the internal subnets. 2 load balancers and all cluster nodes are placed in public subnets.
Multiple availability zones Environment and Data Lake clusters are spread across multiple availability zones. Environment and Data Lake clusters are spread across multiple availability zones. Environment cluster is spread across two availability zones.
Ports open (in the external and internal network) Ports 22 and 443 are open by default. Ports 22 and 443 are open by default. Ports 22 and 443 are open by default.

CDP security setup

Private Semi-private Public
Fine-grained storage access control (RAZ) Enabled Enabled Enabled
SSH access to cluster hosts Root access is possible with a customer-provided keypair. Root access is possible with a customer-provided keypair. Root access is possible with a customer-provided keypair.

CDP versions and details

Private Semi-private Public
Data Lake Runtime version Latest Latest Latest
Data Lake shape Medium Duty Medium Duty Light Duty