Enabling DataFlow for an environment

Enabling DataFlow for an environment is your first step in getting started with Cloudera DataFlow. To do this, ensure that you have met the prerequisites, and then launch the Enable Environment window to walk you through the process.

  • You have an AWS or an Azure account.

  • You have prepared your infrastructure and network. For more information on requirements, see the AWS Resource Planning or the Azure Resource Planning documentation respectively.

  • You have created and registered a CDP environment. For more information, see the Data Hub documentation.

  • FreeIPA is running and healthy.

  • Your Data Lake is started and healthy.

  • You have the DFAdmin role for the environment you want to enable. For more information, see the DataFlow Security documentation

Steps

  1. From the CDP Public Cloud home page, click Cloudera DataFlow, then click Environments.
  2. Find the environment you want to enable, and click Enable to launch the Environments / Enable DataFlow window.

    If the Enable button is grayed out, hover over the Not Enabled icon for more details about the problem.

  3. Configure DataFlow Capacity.

    This defines the minimum and maximum size of the Kubernetes (K8s) cluster. Your DataFlow cluster automatically scales between the minimum and maximum cluster size that you specify here.

  4. Configure Networking.
    1. Specify whether to use a Public Endpoint.

      Select this option when you want to allow users to connect to workload side UIs like the DataFlow Deployment Manager or the actual NiFi UI through the public Internet.

      • If checked when enabling an AWS environment, this option provisions an endpoint (load balancer) in a public subnet.

        If checked when enabling an Azure environment, this option provisions an endpoint (load balancer) in a public subnet and you cannot explicitly specify subnets. The Load Balancer Subnet Use option is not available.

      • If unchecked, CDP creates an endpoint in a private subnet, and you must set up access to the endpoint manually in your cloud account to allow user access to workload side UIs.

    2. Specify Load Balancer Subnets.

      Select from Available Subnets. Explicitly specifying subnets overrides the automatic, tag-based subnet selection process and ensures load balancer provisioning in the specified subnets. If no subnets are specified, DataFlow provisions a load balancer according to how the subnets have been tagged. For more information, see VPC and subnets in the Related information section below.

    3. Specify Worker Node Subnets.

      Select one of the Available Subnets. If you do not make a subnet selection, DataFlow considers any available subnet that has been registered with the environment for worker placement. Worker nodes are only placed in public subnets if no private subnets are available.

    4. Specify Load Balancer Endpoint Access.

      Specify a set of IP address ranges that will be allowed to access the DataFlow load balancer endpoint. Providing no IP address ranges makes the load balancer endpoint open to all traffic.

  5. Configure Kubernetes.
    1. Configure API Server Endpoint Access.

      CDP environments with Cluster Connectivity Manager (CCM) enabled support the ability to create a fully private cluster, which disallows all access to the Kubernetes API Server Endpoint.

      • Specify whether to use a Private Cluster.

        If you select Private Cluster IP based access is not applicable, and the option to specify a set of IP address ranges is not available.

        If you do not select Private Cluster you can specify a set of IP address ranges that are allowed to access the Kubernetes API Server Endpoint. Providing no Classless Inter-Domain Routing (CIDR) makes the Kubernetes API Server Endpoint open to all traffic.

        In either case, any user who needs access to the Kubernetes API Server must be granted remote access to the underlying Kubernetes cluster. This can be configured after DataFlow has been enabled successfully.

    2. Configure Pod CIDR Range.

      The CIDR notation IP range from which to assign IPs to pods in your kubernetes cluster. This address should be a large address space that is not in use elsewhere in your network environment, accommodating the number of nodes that you expect to scale up to. The default value is 10.244.0.0/16. The value is used to assign a /24 address space to each node in the cluster. For example, the first node would be assigned 10.244.0.0/24, the second node 10.244.1.0/24, the third node 10.244.2.0/24. As the cluster scales or upgrades, the platform continues to assign a pod IP address range to each new node.

    3. Configure Service CIDR Range.

      The CIDR notation IP range from which to assign IPs to internal services in your kubernetes cluster. This IP address range should be an address space that is not in use elsewhere in your network environment, accommodating the amount of kubernetes services that you intend to use. The default value is 10.0.0.0/16.

  6. Configure Tags as keys and their values.

    Tags are added to DataFlow resources at the time of enablement. These tags are included in addition to those set by the CDP service.

  7. Click Enable. This may take up to 45 minutes.

Result

Your cluster status changes from Not Enabled to Enabling.

  • Hover over Enabling for environment enablement event messages to display.

  • Click the Alerts tab to see environment enablement event messages.

  • Click anywhere in your environment row to see your environment details.

Before you begin

  • You have installed the CDP CLI.
  • You have run cdp environments list-environments # to obtain the environment-crn.

Steps

  1. To enable DataFlow for an environment, enter:
    
    cdp df enable-service 
    --environment-crn [***ENVIRONMENT_CRN***] 
    --min-k8s-node-count [***MIN_K8S_NODE_COUNT***]
    --max-k8s-node-count [***MAX_K8S_NODE_COUNT***] 
    [--use-public-load-balancer] [--no-use-public-load-balancer] 
    [--private-cluster] [--no-private-cluster] 
    [--kube-api-authorized-ip-ranges [[***KUBE_API_AUTHORIZED_IP_RANGES***[ [[***KUBE_API_AUTHORIZED_IP_RANGES***[ ...]]]
    [--tags [***TAGS***]]
    [--load-balancer-authorized-ip-ranges [[***LOAD_BALANCER_AUTHORIZED_IP_RANGES***] [[***LOAD_BALANCER_AUTHORIZED_IP_RANGES***] ...]]]
    [--cluster-subnets [[***CLUSTER_SUBNETS***] [[***CLUSTER_SUBNETS***] ...]]]
    [--load-balancer-subnets [[***LOAD_BALANCER_SUBNETS***] [[***LOAD_BALANCER_SUBNETS***] ...]]]
    

    Where:

    • --environment-crn specifies the environment-crn you obtained while completing the pre-requisites.
    • --min-k8s-node-count and --max-k8s-node-count specify your DataFlow capacity. Your DataFlow cluster automatically scales between the minimum and maximum cluster size that you specify here.
    • [--use-public-load-balancer] [--no-use-public-load-balancer]
    • [--private-cluster] [--no-private-cluster]
    • --kube-api-authorized-ip-ranges specifies a set of IP address ranges that will be allowed to access the Kubernetes API Server Endpoint. Providing no IP address ranges makes the Kubernetes API Server Endpoint open to all traffic.
    • --tags specifies any tags you want to add to DataFlow resources at the time of enablement. These tags are in addition to those set by the CDP service.
    • --load-balancer-authorized-ip-ranges
    • --cluster-subnets
    • --load-balancer-subnets

Example

Successfully enabling DataFlow for an AWS environment results in output similar to:

{
  "service": {
    "kubeApiAuthorizedIpRanges": [],
    "validActions": [
      "ENABLE"
    ],
    "loadBalancerAuthorizedIpRanges": [],
    "clusterSubnets": [],
    "loadBalancerSubnets": [],
    "tags": {},
    "crn": "crn:cdp:df:us-west-1:CLOUDERA:service:96827a6b-bd8b-42fb-85ad-bdb4fc7ca43d",
    "environmentCrn": "crn:cdp:environments:us-west-1:CLOUDERA:environment:dev-east-local-k8s",
    "name": "dev-east-local-k8s",
    "cloudPlatform": "LOCAL_K8S",
    "region": "local-k8s",
    "deploymentCount": 0,
    "minK8sNodeCount": 3,
    "maxK8sNodeCount": 4,
    "status": {
      "state": "ENABLING",
      "message": "Enabling DataFlow crn:cdp:environments:us-west-1:CLOUDERA:environment:dev-east-local-k8s",
      "detailedState": "NEW"
    },
    "runningK8sNodeCount": 0,
    "instanceType": "m5.xlarge",
    "dfLocalUrl": "",
    "activeWarningAlertCount": "0",
    "activeErrorAlertCount": "0",
    "clusterId": "",
    "clusterUsable": false,
    "usePublicLoadBalancer": false,
    "usePrivateCluster": false,
    "creatingK8sNodeCount": 0,
    "terminatingK8sNodeCount": 0
  }
}

Once you have enabled your DataFlow environment, you are ready to deploy your first flow definition from the Catalog. For instructions, see Deploying a flow definition.

For more information on managing and monitoring DataFlow, see Managing DataFlow in an environment.