Determining the CIDR range

Options for determining the CIDR range for CDP Public Cloud for AWS.

When registering an AWS environment in CDP, you are asked to select a VPC and one or more subnets. You have two options:

  1. CDP will create a new VPC and subnets.
  2. Select an existing VPC.

In both cases, use this guide to calculate and verify the limits of the VPC and subnets available in your AWS subscription to ensure that you have enough networking resources to create clusters in CDP.

Option 1: CDP creates the VPCs and subnets

If you would like CDP to create a new VPC, you will need to specify a /16 CIDR in IPv4 range that will be used to define the range of private IPs for VM instances provisioned into these subnets. The default is 10.10.0.0/16.

By default CDP creates 6 subnets (3 private and 3 public) and divides the address space as follows:

  • 3 x /19 private subnets for FreeIPA, Data Lake, Data Hub, Data Warehouse, Machine Learning, Data Engineering, DataFlow, Operational Database.
  • 3 x /24 public subnets reserved for future use.
  • CDP creates an internet gateway for your VPC and deploys a NAT gateway in each public subnet and configures it in the route table of each private subnet.
  • All the resources created in the VPC are assigned with relevant tags (like ‘kubernetes.io/role/internal-elb:1’ for private subnets and ‘kubernetes.io/role/elb:1’ for public subnets).

Option 2: Existing VPC and subnets

If you would like to use an existing VPC, the subnet requirements vary based on the services used. This section is a guide for calculating network requirements per service.

Subnets for Data Lake and Data Hub

Both Data Lake and Data Hub share the same subnet, but due to Data Lake’s dependency on RDS, CDP admins need at least two subnets.

It is recommended to use subnets of /19 CIDR. If you would like to use a smaller subnets, use the following guidelines:

  • One IP address is used for each VM.
  • Two IPs for your RDS instances.
  • One Light Duty Data Lake cluster uses 2 VMs and 1 additional VMs for FreeIPA.
  • One Medium Duty Data Lake cluster uses 10 VMs and 3 additional VMs for FreeIPA.
  • The exact number of VMs depends on the Data Hub’s cluster definition, but a typical Data Hub cluster uses a minimum of four VMs as a starting point and this number can be dynamically scaled up or down.
  • Make sure you allocate enough IPs to handle each cluster running at peak capacity

Subnets for Data Warehouse

The Data Warehouse service needs three subnets. You can choose the specific subnets that should be used when you activate Data Warehouse for an environment and if the specified subnets are shared with other CDP data services then ensure that there are sufficient IPs left for CDW.

An environment can be activated for CDW with Overlay Network or without Overlay Network, the next section outlines the benefits and the pitfalls of one choice over the other.

Using Data Warehouse with Overlay Network vs without Overlay Network

Enabling overlay networks creates two network spaces in your EKS:

  • A node network space, which derives per-node IP addresses from the VPC.
  • A Kubernetes pod network space, which derives per-pod IP addresses from the CNI plugin's own network space.

The overlay network is bridged into the node network. As a result, one IP address is required per node instead of one IP address needed per pod. It is recommended to enable overlay networks with Data Warehouse if the available IPs in your subnets is less than 1024.

Even though using an overlay network for CDW requires less number of IPs it adds more metadata to each network packet which even though are processed in the kernel can have a performance impact. Due to this extra metadata added to each network packet it may not be straightforward to debug incoming and outgoing traffic with some of the network debugging tools. To employ an overlay network for CDW, Weave net CNI (Container Network Interface) plugin is used. Since weave-net is a full-mesh network where each node is connected to every other node in the mesh, the number of nodes the overlay network can span across is limited to 200.

Data Warehouse with Overlay Networks

Use these guidelines to arrive at your desired subnet CIDR if you choose to use overlay networks. It is recommended to enable overlay networks with Data Warehouse if the available IPs in your subnets is less than 1024.

VM type No of VMs Total no of IPs addresses required
DW Shared Services - (Shared among all VWs in an environment) 3 3
Per Database Catalog (One catalog is created by default, you can create additional catalogs) 2 2
Shared Services per Virtual Warehouse (HS2, Hue, DAS, coordinators, catalog, statestored, and so on.) 3 3
Per Virtual Warehouse (XS) * 2 to 20 2 to 20
Per Virtual Warehouse (S) * 10 to 100 10 to 100
Per Virtual Warehouse (M) * 20 to 200 20 to 200
Per Virtual Warehouse (L) * 40 to 400 40 to 400
Per Virtual Warehouse (Custom)* x to 10x ( where x is the initial node count ) x to 10x ( where x is the initial node count)

* Each autoscaling activity can be treated as deploying a new Virtual Warehouse. For example, when a XS Virtual Warehouse is scaled once, it uses four VMs instead of two.

Data Warehouse without Overlay Network

Use these guidelines to arrive at your desired subnet CIDR if you choose to not to use overlay networks while activating your DW environment.

VM type No of VMs Maximum no of pods per VM Maximum number of IPs per VM (No of pods per VM +1 per VM) Maximum no of IPs addresses required
DW Shared Services - (Shared among all VWs in an environment) 3 25 26 78

Per Database Catalog

(One catalog is created by default, you can create additional catalogs)

2 25 26 52
Shared Services per Virtual Warehouse (HS2, Hue, DAS, coordinators, catalog, statestored, and so on.) 3 25 26 78
Per Virtual Warehouse (XS) * 2 to 20 10 11 22 to 220
Per Virtual Warehouse (S) * 10 to 100 10 11 110 to 1100
Per Virtual Warehouse (M) * 20 to 200 10 11 220 to 2200
Per Virtual Warehouse (L) * 40 to 400 10 11 440 to 4400
Per Virtual Warehouse (Custom)*

x to 10 x ( where x is the initial node count )

10 11 11x to 110x

* Each autoscaling activity can be treated as deploying a new Virtual Warehouse. For example, when a XS Virtual Warehouse is scaled once, it uses four VMs instead of two.

Query Isolation

If the Query Isolation feature has been enabled for a Virtual Warehouse and a query scans more than the threshold set in the hive.query.isolation.scan.size.threshold parameter, the planner runs the query in isolation. This means that an isolated standalone executor group is spawned to run the data-intensive query. The number of executors spawned to run the query does not exceed the default setting for the hive.query.isolation.max.nodes.per.query parameter, which default to 2 times the virtual warehouse template size. The number of isolated parallel queries (can go up to 400) and number of nodes per isolated queries (can go up to 400) is configurable for a given virtual warehouse. Each of these nodes can consume up to 11 IPs when overlay networks are not enabled and 1 IP if the overlay network is enabled against your environment.

Subnets for Machine Learning

CML requires at least two subnets in two different availability zones and you can choose which subnets should be used by a workspace at the time of provisioning. If the specified subnets are shared with other CDP data services then ensure that there are sufficient IPs left for CML. CML uses Calico CNI (Container Network Interface) to run ML pods in an overlay network. The formula to calculate IP Addresses per workspace is as follows:

  • Each workspace can grow up to 100 CPU worker nodes and 100 GPU workers; each node consumes 1 IP address.
  • In addition, you will need to allocate 9 IP addresses for infrastructure nodes (3 IPs for ML infra nodes, 4 for Liftie infra nodes and 2 load balancers).

Subnets for Data Engineering

CDE requires at least two subnets in two different availability zones. You choose at the time of workspace provisioning which subnets should be used. Ensure that there are adequate IPs left for CDE to provide for the maximum expected size of the cluster. CDE uses Calico CNI (Container Network Interface) to run pods in an overlay network. A /24 CIDR is recommended for CDE subnets, but for a custom range the formula to calculate IP addresses per CDE service is as follows:

  • Each CDE service can scale up to 100 compute nodes, each node consumes one IP address.
  • In addition, you need to allocate 5 IP addresses for the infrastructure nodes (1 IP for DE infra node, 4 for Liftie infra nodes) and 2 IP addresses per virtual cluster for the virtual cluster service nodes.

Subnets for Data Flow

CDF requires at least two subnets in two different availability zones. DataFlow by default configures EKS to run in private subnets, if they are available. CDF uses Calico CNI (Container Network Interface) to run pods in an overlay network. The CIDR block for the subnets must be sized appropriately in each CDF environment to accommodate the following:

  • Each DataFlow cluster can grow up to 50 autoscaling compute instances, each of which consumes 1 IP address.
  • A fixed overhead of 48 IP addresses for three instances for core DataFlow services.

Subnets for Operational Database

COD currently leverages Data Hub to deploy infrastructure in a private subnet. Clients associated with HBase REST server, Thrift Server, or the Phoenix Thin server can be proxied via the VPN gateway. If Apache HBase Java API or Apache Phoenix Thick JDBC client are used, an edge node must be configured to access the private computing resources. The formula to calculate IP addresses per COD database is as follows:

  • Each COD database defaults to a minimum of 9 nodes (1 leader, 2 master, 1 gateway, and 5 worker nodes), requiring 9 IP addresses.
  • Each COD database can autoscale the number of worker nodes given various factors/attributes. The range defaults to a minimum of 5 nodes and a maximum of 20 nodes. Users can reduce the minimum to 3 nodes, but there is currently no fixed upper limit. They must consider the behaviour of their database while providing the CIDR range to accommodate for the potential autoscaling growth, where each node takes 1 IP address.
  • In addition, configuring an edge node, if required for client applications, takes 1 IP address for 1 accessory node.

Associating additional CIDRs to a VPC

You can choose to add another CIDR to your VPC in case you are close to exhausting the available IPs in your VPC by following the necessary steps from Associate a secondary IPv4 CIDR block with your VPC. Once a new CIDR has been added to your VPC, a created environment not will pick these changes automatically. For the environment to pick up the new CIDR, follow these steps.

  1. Go to your environment page and under Summary tab click on the refresh icon in the Network card.


  2. To add any new subnets created with the new CIDR, click on the pencil icon and choose the new subnet from the dropdown.


The user can now use the new subnet for any subsequent workloads that will be created.