VNet and subnet planning
Whether you decide to use your own VNet for Cloudera or have Cloudera create one for you, you should carefully plan your network, calculating and verifying the limits of the VNet and subnets available in your Azure subscription to ensure that you have enough networking resources to create clusters in Cloudera.
When registering an Azure environment in Cloudera, you are asked to select a VNet and one or more subnets. You have two options:
- Cloudera will create a new VNet and subnets
- Select a VNet and subnets that you previously created
In both cases, use this guide to calculate and verify the limits of the VNet and subnets available in your Azure subscription to ensure that you have enough networking resources to create clusters in Cloudera.
Option 1: Cloudera creates the VNet and subnets
If you would like Cloudera to create a new VNet, you will need to specify a valid CIDR in IPv4 range that will be used to define the range of private IPs for VM instances provisioned into these subnets. This must be a /16 CIDR, but you can customize the IP Range. The default is 10.10.0.0/16.
You cannot use the following reserved CIDR blocks for your VNet:
- 10.0.0.0/16
- 10.244.0.0/16
- 172.17.0.1/16
- 10.20.0.0/16
- 10.244.0.0/16
Cloudera will divide this address range as follows:
- 32 x /24 subnets - Recommended for Cloudera AI workbenches, Cloudera Data Engineering Services, and Cloudera DataFlow Services
- 3 x /19 subnets - Recommended for Cloudera Data Warehouse service
- 3 x /19 subnets - Recommended for Data Lake and Cloudera Data Hub
- 3 x /24 subnets - Reserved for future use
If you would like to have a minimal virtual network instead, you can use the guide outlined in the next option.
Option 2: Existing VNet and subnets
If you would like to use an existing VNet, the subnet requirements vary based on the services used. Below is a guide for calculating network requirements per service.
In addition, make sure you follow the following guidelines:
- You cannot use the following reserved CIDR blocks for your VNet:
- 10.0.0.0/16
- 10.244.0.0/16
- 172.17.0.1/16
- 10.20.0.0/16
- 10.244.0.0/16
- The Microsoft.Storage and Microsoft.SQL Service endpoints should be registered for all subnets that will be used by Cloudera.
Subnets for Data Lake and Cloudera Data Hub
Both Data Lake and Cloudera Data Hub share the same subnet, so only one subnet is required.
Cloudera recommends a minimum of a /24 CIDR. If you would like to use a smaller subnet, use the following guidelines:
- One IP address is used for each VM
- One Light Duty Data Lake cluster uses three VMs
- A typical Cloudera Data Hub cluster uses a minimum of four VMs as a starting point, but this number can be dynamically scaled up or down
- Make sure you allocate enough IPs to handle each cluster running at peak capacity
Subnets for Cloudera Data Warehouse
The Cloudera Data Warehouse service needs one subnet. You can choose the specific subnet used by Cloudera Data Warehouse when you activate Cloudera Data Warehouse for an environment. This subnet should not be shared with any of the other Cloudera applications.
Cloudera recommends a /20 or larger subnet as it can be difficult to accurately predict the size of each VW due to autoscaling.
If you would like to size the subnets to a smaller CIDR, the following guidelines assume that you are activating your Cloudera Data Warehouse environment with the default settings (no overlay networks):
VM Purpose | # VMs | # pods per VM | IPs per VM (1 for the instance + 1 per pod) | Total IP addresses required |
Cloudera Data Warehouse Shared Services - (Shared among all VWs in an environment) | 3 | 30 | 31 | 93 |
Per Database Catalog (One catalog is created by default, you can create additional catalogs) |
2 | 30 | 31 | 62 |
Per Virtual Warehouse (XS) - without autoscaling* | 2 | 10 | 11 | 22 |
Per Virtual Warehouse (S) - without autoscaling* | 10 | 10 | 11 | 110 |
Per Virtual Warehouse (M) - without autoscaling* | 20 | 10 | 11 | 220 |
Per Virtual Warehouse (L) - without autoscaling* | 40 | 10 | 11 | 440 |
VM Purpose | # VMs | Total IP addresses required |
Cloudera Data Warehouse Shared Services (shared among all VWs in an environment) |
3 | 3 |
Per Database Catalog (One catalog is created by default, you can create additional catalogs) |
2 | 2 |
Per Virtual Warehouse (XS) - without autoscaling* | 2 | 2 |
Per Virtual Warehouse (S) - without autoscaling* | 10 | 10 |
Per Virtual Warehouse (M) - without autoscaling* | 20 | 20 |
Virtual Warehouse (L) - without autoscaling* | 40 | 40 |
* Each autoscaling activity can be treated as deploying a new Virtual Warehouse. For example, when a XS Virtual Warehouse is scaled once, it uses four VMs instead of two.
Subnets for Cloudera AI
Azure Files NFS v4.1 is a managed, POSIX compliant NFS service on Azure. The file share is used to store files for the Cloudera AI infrastructure and Cloudera AI workbenches. This is the recommended NFS service for use with Cloudera AI. You need one separate subnet delegated to the Azure Files NFS service (all workspaces in a region will share this service). Cloudera recommends a /28 subnet for this purpose.
- Each workspace can grow up to 30 CPU worker nodes and 30 GPU workers; each node consumes one IP address.
- In addition, you need to allocate up to 11 IP address (6 infrastructure nodes and 5 for auxiliary networking usage).
For more information, see Network Planning for Cloudera AI on Azure.
Subnets for Cloudera Data Engineering
Cloudera Data Engineering runs in the VNet registered in Cloudera as part of your Azure environment.
Each Cloudera Data Engineering service requires its own subnet. CDE on AKS uses the Kubenet CNI plugin provided by Azure. In order to use Kubenet CNI, we need to create multiple smaller subnets when creating an Azure environment. It is recommended to partition the vnet with subnets that is just the right size to fit the expected max nodes in the cluster.
- Each Cloudera Data Engineering service can scale up to 100 compute nodes; each node consumes one IP address.
- In addition, you need to allocate 3 IPs for the base infra nodes and 2 IP addresses per virtual cluster for the virtual cluster service nodes.
Subnets for Cloudera DataFlow
Cloudera DataFlow runs in the VNet registered in Cloudera as part of your Azure environment.
The Cloudera DataFlow service requires its own subnet. Cloudera DataFlow on AKS uses the Kubenet CNI plugin provided by Azure. In order to use Kubenet CNI, create multiple smaller subnets when creating an Azure environment.
- Each Cloudera DataFlow service can scale up to 50 compute nodes; each node consumes one IP address.
- In addition, allocate two IPs for the base infra nodes.