Option 2: Existing VPC and subnets

If you would like to use an existing VPC, the subnet requirements vary based on the services used. This section is a guide for calculating network requirements per service.

Subnets for Data Lake and Data Hub

Both Data Lake and Data Hub share the same subnet, but due to Data Lake’s dependency on RDS, CDP admins need at least two subnets.

It is recommended to use subnets of /19 CIDR. If you would like to use a smaller subnets, use the following guidelines:

  • One IP address is used for each VM.
  • Two IPs for your RDS instances.
  • One Light Duty Data Lake cluster uses 2 VMs and 1 additional VMs for FreeIPA.
  • One Medium Duty Data Lake cluster uses 10 VMs and 3 additional VMs for FreeIPA.
  • The exact number of VMs depends on the Data Hub’s cluster definition, but a typical Data Hub cluster uses a minimum of four VMs as a starting point and this number can be dynamically scaled up or down.
  • Make sure you allocate enough IPs to handle each cluster running at peak capacity

Subnets for Data Warehouse

The Data Warehouse service needs three subnets. You can choose the specific subnets that should be used when you activate Data Warehouse for an environment and if the specified subnets are shared with other experiences then ensure that there are sufficient IPs left for CDW.

An environment can be activated for CDW with Overlay Network or without Overlay Network, the next section outlines the benefits and the pitfalls of one choice over the other.

Using Data Warehouse with Overlay Network vs without Overlay Network

Enabling overlay networks creates two network spaces in your EKS:

  • A node network space, which derives per-node IP addresses from the VPC.
  • A Kubernetes pod network space, which derives per-pod IP addresses from the CNI plugin's own network space.

The overlay network is bridged into the node network. As a result, one IP address is required per node instead of one IP address needed per pod. It is recommended to enable overlay networks with Data Warehouse if the available IPs in your subnets is less than 1024.

Even though using an overlay network for CDW requires less number of IPs it adds more metadata to each network packet which even though are processed in the kernel can have a performance impact. Due to this extra metadata added to each network packet it may not be straightforward to debug incoming and outgoing traffic with some of the network debugging tools. To employ an overlay network for CDW, Weave net CNI (Container Network Interface) plugin is used. Since weave-net is a full-mesh network where each node is connected to every other node in the mesh, the number of nodes the overlay network can span across is limited to 200.

Data Warehouse with Overlay Networks

Use these guidelines to arrive at your desired subnet CIDR if you choose to use overlay networks. It is recommended to enable overlay networks with Data Warehouse if the available IPs in your subnets is less than 1024.

VM type No of VMs Total no of IPs addresses required
DW Shared Services - (Shared among all VWs in an environment) 3 3
Per Database Catalog (One catalog is created by default, you can create additional catalogs) 2 2
Shared Services per Virtual Warehouse (HS2, Hue, DAS, coordinators, catalog, statestored, and so on.) 3 3
Per Virtual Warehouse (XS) * 2 to 20 2 to 20
Per Virtual Warehouse (S) * 10 to 100 10 to 100
Per Virtual Warehouse (M) * 20 to 200 20 to 200
Per Virtual Warehouse (L) * 40 to 400 40 to 400
Per Virtual Warehouse (Custom)* x to 10x ( where x is the initial node count ) x to 10x ( where x is the initial node count)

* Each autoscaling activity can be treated as deploying a new Virtual Warehouse. For example, when a XS Virtual Warehouse is scaled once, it uses four VMs instead of two.

Data Warehouse without Overlay Network

Use these guidelines to arrive at your desired subnet CIDR if you choose to not to use overlay networks while activating your DW environment.

VM type No of VMs Maximum no of pods per VM Maximum number of IPs per VM (No of pods per VM +1 per VM) Maximum no of IPs addresses required
DW Shared Services - (Shared among all VWs in an environment) 3 25 26 78

Per Database Catalog

(One catalog is created by default, you can create additional catalogs)

2 25 26 52
Shared Services per Virtual Warehouse (HS2, Hue, DAS, coordinators, catalog, statestored, and so on.) 3 25 26 78
Per Virtual Warehouse (XS) * 2 to 20 10 11 22 to 220
Per Virtual Warehouse (S) * 10 to 100 10 11 110 to 1100
Per Virtual Warehouse (M) * 20 to 200 10 11 220 to 2200
Per Virtual Warehouse (L) * 40 to 400 10 11 440 to 4400
Per Virtual Warehouse (Custom)*

x to 10 x ( where x is the initial node count )

10 11 11x to 110x

* Each autoscaling activity can be treated as deploying a new Virtual Warehouse. For example, when a XS Virtual Warehouse is scaled once, it uses four VMs instead of two.

Query Isolation

If the Query Isolation feature has been enabled for a Virtual Warehouse and a query scans more than the threshold set in the hive.query.isolation.scan.size.threshold parameter, the planner runs the query in isolation. This means that an isolated standalone executor group is spawned to run the data-intensive query. The number of executors spawned to run the query does not exceed the default setting for the hive.query.isolation.max.nodes.per.query parameter, which default to 2 times the virtual warehouse template size. The number of isolated parallel queries (can go up to 400) and number of nodes per isolated queries (can go up to 400) is configurable for a given virtual warehouse. Each of these nodes can consume up to 11 IPs when overlay networks are not enabled and 1 IP if the overlay network is enabled against your environment.

Subnets for Machine Learning

CML requires at least two subnets in two different availability zones and you can choose which subnets should be used by a workspace at the time of provisioning. If the specified subnets are shared with other experiences then ensure that there are sufficient IPs left for CML. CML uses Calico CNI (Container Network Interface) to run ML pods in an overlay network. The formula to calculate IP Addresses per workspace is as follows:

  • Each workspace can grow up to 100 CPU worker nodes and 100 GPU workers; each node consumes 1 IP address.
  • In addition, you will need to allocate 9 IP addresses for infrastructure nodes (3 IPs for ML infra nodes, 4 for Liftie infra nodes and 2 load balancers).