Azure account requirements for DataFlow

As the administrator for your Azure environment, ensure that the environment meets the requirements listed in this topic to enable the Cloudera DataFlow experience in CDP Public Cloud.

The following image lists the main steps to ensure that your Azure environment meets the CDP and DataFlow requirements:

1. Review the Azure account prerequisites for CDP🔗

Refer to the Azure subscription requirements, and make sure that the Azure account you are using for CDP has the required resources, and that you have the permissions required to manage these resources.

2. Specific networking requirements

Vnet and Subnet Requirements

When registering an Azure environment in CDP, you will be asked to select a VNet and one or more subnets. DataFlow runs in the VNet registered in CDP as part of your Azure environment.

You have two options:

  • Use your existing VNet and subnets for provisioning CDP resources
  • Have CDP create a new VNet and subnets
Subnets for DataFlow
DataFlow runs in the VNet registered in CDP as part of your Azure environment. The DataFlow service requires its own subnet. DataFlow on AKS uses the Kubenet CNI plugin provided by Azure. In order to use Kubenet CNI, create multiple smaller subnets when creating an Azure environment.
Cloudera recommends the following:
  • Partition the VNet with subnets that are just the right size to fit the expected max nodes in the cluster.
  • Use /24 CIDR for these subnet. However, if you prefer a custom range, use the following points to determine the IP addresses for the DataFlow service:
    • The DataFlow service can scale up to 50 compute nodes.
    • Each node consumes one IP address.
    • Additionally, you must allocate two IPs for the base infra nodes.
Firewall exceptions for Azure AKS
If you need to restrict egress traffic in Azure, then you must reserve a limited number of ports and addresses for cluster maintenance tasks including cluster provisioning. See Control egress traffic for cluster nodes in Azure Kubernetes Service (AKS) to prepare your Azure environment for AKS deployment
Cloudera recommends you safelist the Azure portal URLs on your firewall or proxy server for management purposes. For more information, see Safelist the Azure portal URLs on your firewall or proxy server.

3. Use only app-based credentials

For the DataFlow service, you must only use an app-based credential, which requires the Contributor role to create a new service principal. For more information about creating an app-based credential for the environment you want to use for the DataFlow service, see Create an app-based credential. If you need to change your environment credential, see Change environment's credential. Both of these references are in the Management Console documentation.

4. App must have the Contributor role at the subscription level

For environments that you plan to use for the DataFlow service, you must ensure that the application you create in Azure has the built-in Contributor Azure role at the Azure subscription level. For more information, see the description of app-based credentials in Credential options on Azure.

5. Created Azure app must have access to the storage account used during environment registration

Ensure that the application, which the Azure app-based credentials are attached to, must have access to the ADLS Gen2 storage location that is specified when you register the Azure environment. This is the storage location specified in Step 6 in the Register an Azure environment topic. Also see ADLS Gen2 and managed identities for information about storage accounts for Azure environments. See Minimal setup for cloud storage for further details. These references are in the Management Console documentation.

6. List of required resources for Azure environments

Azure environments used for the DataFlow service must have the following resources available in the specific Azure region where the environment is registered. Currently, there is no cross-regional support for DataFlow service.

7. Azure subscription should be in a similar region as the resources

Ensure that your Azure subscription is in a relatively similar region as the region where your resources are deployed. Particularly, be careful that the regions are governed by the same regulatory laws. For more information, see Azure region requirements in the Management Console documentation. In that topic it specifies that "CDP requires that the ADLS Gen2 storage location provided during environment registration must be in the same region as the region selected for the environment." In addition, please review Azure geographies in the Microsoft documentation.

8. Register an Azure Environment in CDP

Once you have met cloud provider requirements and have created the Azure provisioning credential for CDP, you may proceed to register an Azure Environment.

Instructions: Register an Azure environment

9. DataFlow Role Requirements

There are two CDP user roles associated with the DataFlow service: DFAdmin and DFUser. Any CDP user with the EnvironmentAdmin (or higher) access level must assign these roles to users who require access to the DataFlow console within their environment.

10. Set up to run kubectl

  1. From the DataFlow UI, click Actions.
  2. Select Download the Kube Config and save the file. For example, ~/.kube/cdf-env1-kube-config.
  3. Run the following shell command:
    $ export KUBECONFIG=~/.kube/cdf-env1-kube-config