Component description

This section includes an overview of the VNet, subnets, gateways and route tables, and security groups required for CDP Public Cloud for Azure.

VNet

An Azure Virtual Network (VNet) is needed for deploying CDP workloads into the customer’s cloud account. VNet is similar to a traditional network that you would operate in your own data center, but brings with it additional benefits of Azure's infrastructure such as scale, availability, and isolation.

Cloudera recommends that the VNet used for CDP is configured with the properties specified below:
  • The CIDR block for the VNet should be sufficiently large for supporting all the Data Hubs and data services that you intend to run. Refer to VNet and subnet planning to understand how to compute the CIDR block range.
  • In addition, you may want to ensure that the CIDR ranges assigned to the VNet do not overlap with any of your on-premise network CIDR ranges, as this may be a requirement for setting up connectivity from your on-premise network to the subnets.

Subnets

A subnet is a partition of the virtual network in which CDP workloads are launched.

It is recommended that the subnets be configured with the following properties:
  • The CIDR block for the subnet should be sufficiently large for supporting the CDP workload targeted to run inside it. Refer to VNet and subnet planning to understand how to compute the CIDR block range.
  • Several CDP data services run on Kubernetes and use Kubenet CNI plugin for networking. Azure's Kubenet CNI plugin requires that the subnet is not shared between multiple Kubernetes clusters as it adds custom routes (see Bring your own subnet and route table with kubenet). Therefore, as many subnets as the expected number of workloads need to be created.
  • In addition, you may want to ensure that the CIDR ranges assigned to the subnets do not overlap with any of your on-premise network CIDR ranges, as this may be a requirement for setting up connectivity from your on-prem network to the subnets.
  • A subnet can be associated with a Network Security Group (NSG). However, since Cloudera works with a fully private network configuration where communication is always initiated from VMs within the subnets, an NSG at subnet level is generally not useful for this configuration.

Gateways and route tables

This topic covers recommended gateway and route table configurations for CDP Public Cloud for Azure.

Connectivity from Control Plane to CDP workloads

  • As described in the Subnets section above, each CDP data service workload requires its own subnet and a non-shared route table associated with it (see Bring your own subnet and route table with kubenet).
  • As described in the Taxonomy of network architectures, nodes in the CDP workloads need to connect to the CDP Control Plane over the internet to establish a “tunnel” over which the CDP Control Plane can send instructions to the workloads.
  • Private AKS cluster is a feature that lets CDP access a Kubernetes workload cluster over a private IP address (see Enabling a private CDW environment in Azure Kubernetes Service). When it is enabled for a CDW workload, CDW requires the user to have already set up internet connectivity for the subnet (see Outbound type of userDefinedRouting).
  • CDP data services such as Datalake and CML create a public load balancer for internet connectivity.
  • If a firewall is configured, the destinations described in Azure outbound network access destinations need to be allowed for the CDP workloads to work.

Connectivity from customer on-prem to CDP workloads

  • As described in the Use cases section (see CDP Public Cloud reference network architecture for Azure>Use Cases), data consumers need to access data processing or consumption services in the CDP workloads.
  • Given that these are created with private IP addresses in private subnets, the customers need to arrange for access to these addresses from their on-prem or corporate networks in specific ways.
  • There are several possible solutions for achieving this, but one that is depicted in the architectural diagram, uses Azure VPN Gateway.
  • Each workload provides an option to connect to the workload endpoint over a public or a private IP. It's recommended that the public IPs are disabled and that users rely on the corporate network connectivity for accessing the workloads with private IPs.

Private endpoints

CDP workloads can be configured to access the Azure resources over a private IP (which is called a private endpoint) or over a public IP (which is called a service endpoint).

The private endpoint setup requires a private DNS zone which the VNet is linked to and resolves the Azure resources to private IP addresses. CDP supports a private endpoint configuration for Azure postgres only. CDP admin can choose to either create the private DNS zone and link it to the VNet as described in Bringing your own private DNS, or let the CDP create them when provided with necessary configuration described in Using CDP-managed private DNS.

CDP supports only service endpoint configuration for other Azure resources(such as Microsoft Storage). The subnets need to be enabled to support the service endpoints. See Service endpoint for Azure Postgres for detailed steps.

Security groups

During the specification of a VNet to CDP, the CDP admin specifies security groups that will be associated with all the CDP workloads launched within that VNet. These security groups will be used in allowing the incoming traffic to the hosts.

Security groups for Data Lakes

During the specification of a VNet to CDP, the CDP admin can either let CDP create the required security groups, taking a list of IP address CIDRs as input; or create them in Azure and then provide them to CDP.

When getting started with CDP, the CDP admin can let CDP create security groups, taking a list of IP address CIDRs as input. These will be used in allowing the incoming traffic to the hosts. The list of CIDR ranges should correspond to the address ranges from which the CDP workloads will be accessed. In a VPN-peered VNet, this would also include address ranges from customer’s on-prem network. This model is useful for initial testing given the ease of set up.

Alternatively, the CDP admin can create security groups on their own and select them during the setup of the VNet and other network configuration. This model is better for production workloads, as it allows for greater control in the hands of the CDP admin. However, note that the CDP admin must ensure that the rules meet the requirements described below.

For a fully private network, network security groups should be configured according to the types of access requirements needed by the different services in the workloads. The Network security groups section includes all the details of the necessary rules that need to be configured.

Note that for a fully private network, even specifying an open access here (such as 0.0.0.0/0) is restrictive because these services are deployed in a private subnet without a public IP address and hence do not have an incoming route from the Internet. However, the list of CIDR ranges may be useful to restrict which private subnets of the customer’s on-prem network can access the services.

Rules for AKS based workloads are described separately in the following section.

Additional rules for AKS-based workloads

At the time of enabling a CDP data service, the CDP admin can specify a list of CIDR ranges that will be used in allowing the incoming traffic to the workload Load Balancer.

This list of CIDR ranges should correspond to the address ranges from which the CDP data service workloads will be accessed. In a VPN peered VNet, this would also include address ranges from customer’s on-prem network. In a fully private network setup, 0.0.0.0/0 implies access only within the VNet and the peered VPN network which is still restrictive.

CDW data service currently supports fully private AKS clusters, which means the Kubernetes API server will have a private IP address. CML uses a public endpoint by default for all AKS cluster control planes. It is highly recommended to provide a list of outbound public CIDR ranges at the time of provisioning a data service to restrict access to the AKS clusters. In addition, the data service adds CDP public CIDR range to allow access between workloads and CDP Control Plane. An example configuration section for a data service looks like below:



Specific guidelines for restricting access to Kubernetes API server and workloads are detailed in Restricting access for CDP services that create their own security groups on Azure by each data service.

Within the AKS cluster, security groups are defined to facilitate AKS control plane-pod communication, inter-pod and inter-worker node communication as well as workload communication through Load Balancers.

Outbound connectivity requirements

Outbound traffic from the worker nodes is unrestricted and is targeted at other Azure services and CDP services. The comprehensive list of services that get accessed from a CDP environment can be found in Azure outbound network access destinations.

Domain names for the endpoints

The previous sections dealt with how connectivity is established to the workload infrastructure. This section deals with “addressability”.

The workloads launched by CDP contain a few services that need to be accessed by the CDP admins and data consumers. These include services such as Cloudera Manager; metadata services such as the Hive Metastore, Atlas, or Ranger; and data processing or consumption services such as Oozie server, Hue, and so on. Given the nature of the cloud infrastructure, the IP addresses for the nodes running these services may change (for example if the infrastructure is restarted or repaired). However, these should have statically addressable DNS names so that users can access them with the same names.

In order to help with this, CDP assigns DNS names to these nodes. These naming schemes have the following properties:

  • The DNS name is of the following format for each Data Lake node and the Data Lake cluster endpoint:
    <CLUSTER_NAME>-{<HOST_GROUP><i>}.<ENVIRONMENT-IDENTIFIER>.<CUSTOMER_IDENTIFIER>.cloudera.site
    An example could be:
    my-dataeng-master0.my-envir.aaaa-1234.cloudera.site

    This name has the following components:

    • The base domain is cloudera.site. This is a publicly registered DNS suffix. It is also a registered Route53 hosted zone in a Cloudera owned AWS account.
    • The <CUSTOMER_IDENTIFIER> is unique to a customer account on CDP made of alphanumeric characters and a “-” (dash).
    • The <ENVIRONMENT_IDENTIFIER> is generated based on the environment name and is truncated to 8 characters.
    • The <CLUSTER_NAME> is the cluster name given to the Data Lake or Data Hub. It is appended with a <HOST_GROUP> name such as “gateway”, “master”, “worker”, and so on, depending on the role that the node plays in the cluster. If there are more than one of these nodes playing the same role, they are appended with a serial number <i>.
  • The DNS name of the endpoints of the CDP data services is of the following format:
    • For a Virtual Warehouse in CDW, it is:
      <VIRTUAL_WAREHOUSE_NAME>.dw-<CDW_ENVIRONMENT_IDENTIFIER>.<CUSTOMER_IDENTIFIER>.cloudera.site
      • <VIRTUAL_WAREHOUSE_NAME> is the name of the Virtual Warehouse created. There could be multiple virtual warehouses for a given CDP environment.
      • <CDW_ENVIRONMENT_IDENTIFIER> is the identifier for the CDP environment.
    • For a Session Terminal in CML workspace, it is:
      <TTY_SESSION_ID>.<CML_WORKSPACE_ID>.<ENVIRONMENT_IDENTIFIER>.<CUSTOMER_IDENTIFIER>.cloudera.site
      • <TTY_SESSION_ID> is the ID of the CML Terminal Session ID.
      • <CML_WORKSPACE_ID> is the ID of the CML workspace created.
      • The <ENVIRONMENT_IDENTIFIER> is generated based on the environment name and is truncated to 8 characters. If the 8th character is a “-” (dash), then it is truncated to 7 characters instead.
    • For all the CDP data services listed above, the common portions of the DNS include:
      • The base domain is cloudera.site. This is a publicly registered DNS Suffix. It is also a registered Route53 hosted zone in a Cloudera owned AWS account.
      • The <CUSTOMER_IDENTIFIER> is unique to a customer account on CDP made of alphanumeric characters and a “-” (dash).
    • The length of the DNS name is restricted to 64 characters due to some limitations with Hue workloads.
    • These names are stored as A records in the Route53 hosted zone in the Cloudera managed CDP Control Plane AWS account. Hence, you can resolve these names from any location outside of the VPC. However, note that they would still resolve to private IP addresses and hence are constrained by the connectivity setup described in preceding sections.
    • Within a CDP environment, the DNS resolution happens differently. Every CDP environment has a DNS server that is played by a component called FreeIPA. This server is seeded with the hostnames of the nodes of all workload clusters in the environment. Every node in a Data Lake, Data Hub and data service is configured to look up the FreeIPA DNS service for name resolution within the cluster.
    • FreeIPA nodes are internal to the CDP environment both for resolving and use. They don't have public DNS records.

DNS

Azure-provided DNS is recommended for the VNet. This is required to enable private AKS clusters and private access to Postgres databases among others, as described in Private endpoints above.