DataFlow networking in Azure

DataFlow supports different networking options depending on how you have set up your VNet and subnets. If you want DataFlow to use specific subnets, make sure that you specify them when registering a CDP environment.

Vnet and Subnet Requirements

When registering an Azure environment in CDP, you are asked to select a VNet and one or more subnets. DataFlow runs in the VNet registered in CDP as part of your Azure environment.

You have two options:

  • You use your existing VNet and subnets for provisioning CDP resources.
  • You let CDP create a new VNet and subnets.

Subnets for DataFlow

Learn about DataFlow subnet requirements within your VNet.

DataFlow runs in the VNet registered in CDP as part of your Azure environment. The DataFlow service requires its own subnet. DataFlow on AKS uses the Kubenet CNI plugin provided by Azure. In order to use Kubenet CNI, create multiple smaller subnets when creating an Azure environment.

Cloudera recommends the following:
  • Partition the VNet with subnets that are just the right size to fit the expected maximum of nodes in the cluster.
  • Use /24 CIDR for these subnets. However, if you prefer a custom range, use the following points to determine the IP addresses for the DataFlow service:
    • The DataFlow service can scale up to 50 compute nodes.
    • Each node consumes one IP address.
    • Additionally, you must allocate two IPs for the base infra nodes.

Firewall exceptions for Azure AKS

Learn about required firewall exceptions for maintenance tasks and management.

If you need to restrict egress traffic in Azure, then you must reserve a limited number of ports and addresses for cluster maintenance tasks including cluster provisioning. See Control egress traffic for cluster nodes in Azure Kubernetes Service (AKS) to prepare your Azure environment for AKS deployment.

Cloudera recommends you safelist the Azure portal URLs on your firewall or proxy server for management purposes. For more information, see Safelist the Azure portal URLs on your firewall or proxy server.

Azure load balancers in DataFlow

Learn about load balancing options in Azure environments.

Azure provides a public and a private (internal) load balancer. DataFlow uses the Standard SKU for the load balancer. You can configure DataFlow to use either private or public load balancer to allow users to connect to flow deployments. By default, DataFlow provisions a private load balancer.

The figure represents a DataFlow deployed with an internal load balacer:

User defined routing

Azure UDR (User-Defined Routing) refers to the capability provided by Microsoft Azure to define and control the flow of network traffic within a virtual network (VNet) using custom routing rules. By default, Azure virtual networks use system-defined routes to direct network traffic between subnets and to the internet. With UDR, you can override these default routes and specify your own routing preferences.

UDRs allow you to define specific routes for traffic to follow based on various criteria, such as source or destination IP address, protocol, port, or Next Hop type. This gives you granular control over how traffic flows within your Azure infrastructure. With UDR, you can create your own custom routes to control the path of traffic within your Azure virtual network. UDR is implemented using route tables, which contain a set of rules defining the paths for network traffic. Each subnet within a VNet can be associated with a specific route table. UDR allows you to specify the next hop for a route, which can be an IP address, a virtual appliance, or the default Azure Internet Gateway. UDR enables you to segment traffic within your virtual network, directing specific types of traffic through specific network appliances or services. UDR is commonly used in conjunction with Network Virtual Appliances (NVAs), such as firewalls, load balancers, or VPN gateways. By configuring UDR, you can ensure that traffic is properly routed through these appliances.

By leveraging UDR, you can design more complex networking architectures, implement security measures, optimize performance, and meet specific requirements for your Azure deployments.

You can enable UDR when you enable DataFlow for an environment.