Azure Load Balancers in Data Lakes and Data Hubs

The Azure Load Balancer is used in multiple places in CDP Data Lakes and Data Hubs. It is used as a frontend for Knox in both Data Lakes and Data Hubs, and for Oozie HA in HA Data Hubs.

The following table describes all use cases where an Azure Load Balancer is used in Data Lakes and Data Hubs running on Azure:
CDP component Azure Load Balancer use case
Data Lake A load balancer is configured in front of Knox for Data Lakes of all shapes.
HA Data Hub A load balancer is configured for all Data Hubs created from a default template where Knox and/or Oozie is running in HA mode.

This can be overridden by setting the “enableLoadBalancer” setting in a custom template to “false”.

An environment with Public Endpoint Access Gateway enabled When the Endpoint Gateway is enabled, Load balancers are created in front of Knox for all Data Lakes and Data Hubs attached to the environment.

In the event that a Data Lake or Data Hub uses private networks (meaning the “Create Public IPs” option is disabled during environment creation and the Public Endpoint Access Gateway is not enabled), an internal load balancer is created for ingress traffic to Knox in all Data Lakes and in Knox HA Data Hubs running in that environment.

Because CDP uses a Standard SKU Azure Load Balancer, the internal load balancer does not allow public egress. To allow public egress, a secondary public load balancer is created. The public egress load balancer has only outbound rules defined, and does not handle ingress traffic.

This is illustrated in the following diagram:

If CDP Public Endpoint Access Gateway is enabled for the environment, a public load balancer is created to handle both public ingress to port 443 and public egress.

Disable load balancers in an Azure environment

If a public egress load balancer cannot be used in an environment due to security rules or network configuration, load balancers can be disabled using CDP CLI.

Note that disabling the load balancers has the following effects:

  • In Medium Duty Data Lakes and HA Data Hubs, the Knox API and UI endpoints will not be highly available. The backend services will still be highly available.

  • Oozie HA Data Hubs cannot be used.

Load balancers can be disabled at the environment level, or at the individual Data Lake and Data Hub level:
  • To disable load balancer on an environment level: When creating the environment, include the --no-enable-load-balancers flag. This flag can only be set during environment creation. When this is disabled at the environment level, no Data Lake or Data Hub attached to that environment uses cloud load balancers. Once load balancer is disabled on the environment level, there is no way to override this setting on the Data Lake or Data Hub level.

  • To disable load balancer on a Data Lake or Data Hub level: When creating the Data Lake or Data Hub, include the --no-enable-load-balancer flag.