DataFlow networking in Azure

DataFlow supports different networking options depending on how you have set up your VNet and subnets. If you want DataFlow to use specific subnets, make sure that you specify them when registering a CDP environment.

Vnet and Subnet Requirements

When registering an Azure environment in CDP, you are asked to select a VNet and one or more subnets. DataFlow runs in the VNet registered in CDP as part of your Azure environment.

You have two options:

  • Use your existing VNet and subnets for provisioning CDP resources
  • Have CDP create a new VNet and subnets
Subnets for DataFlow
DataFlow runs in the VNet registered in CDP as part of your Azure environment. The DataFlow service requires its own subnet. DataFlow on AKS uses the Kubenet CNI plugin provided by Azure. In order to use Kubenet CNI, create multiple smaller subnets when creating an Azure environment.
Cloudera recommends the following:
  • Partition the VNet with subnets that are just the right size to fit the expected maximum of nodes in the cluster.
  • Use /24 CIDR for these subnets. However, if you prefer a custom range, use the following points to determine the IP addresses for the DataFlow service:
    • The DataFlow service can scale up to 50 compute nodes.
    • Each node consumes one IP address.
    • Additionally, you must allocate two IPs for the base infra nodes.
Firewall exceptions for Azure AKS
If you need to restrict egress traffic in Azure, then you must reserve a limited number of ports and addresses for cluster maintenance tasks including cluster provisioning. See Control egress traffic for cluster nodes in Azure Kubernetes Service (AKS) to prepare your Azure environment for AKS deployment
Cloudera recommends you safelist the Azure portal URLs on your firewall or proxy server for management purposes. For more information, see Safelist the Azure portal URLs on your firewall or proxy server.
Azure load balancers in DataFlow
Azure provides a public and a private (internal) load balancer. DataFlow uses the Standard SKU for the load balancer. You can configure DataFlow to use either private or public load balancer to allow users to connect to flow deployments. By default, DataFlow provisions a private load balancer.

The figure represents a DataFlow deployed with an internal load balacer: