Enabling DataFlow for an environment

Enabling DataFlow for an environment is your first step in getting started with Cloudera DataFlow. To do this, ensure that you have met the prerequisites, and then launch the Enable Environment window to walk you through the process.

Before you begin

Steps

  1. From the CDP Public Cloud home page, click Cloudera DataFlow, then click Environments.
  2. Find the environment you want to enable, and click Enable to launch the Environments / Enable DataFlow window.

    If the Enable button is greyed out, hover over the Not Enabled icon for more details about the problem.

  3. Configure DataFlow Capacity.

    This defines the Kubernetes cluster minimum and maximum size and specifies the size of the Kubernetes cluster. Your DataFlow cluster automatically scales between the minimum and maximum cluster size that you specify here.

  4. Configure Networking.
    1. Specify DataFlow Subnet Use.

      Select Available Subnets. If you do not make a subnet selection, DataFlow considers all available subnets that have been registered with the environment for worker placement. Worker nodes are only placed in public subnets if no private subnets are available.

    2. Specify whether to use a Public Endpoint.

      Select this option when you want to allow users to connect to workload side UIs like the DataFlow Deployment Manager or the actual NiFi UI through the public Internet.

      • If checked when enabling an AWS environment, this option provisions an endpoint (load balancer) in a public subnet.

        If checked when enabling an Azure environment, this option provisions an endpoint (load balancer) in a public subnet and you cannot explicitly specify subnets. The Load Balancer Subnet Use option is not available.

      • If unchecked, CDP creates an endpoint in a private subnet, and you must set up access to the endpoint manually in your cloud account to allow user access to workload side UIs.

    3. Specify Load Balancer Subnet Use.

      Select Available Subnets. Explicitly specifying subnets overrides the automatic, tag-based subnet selection process and ensures load balancer provisioning in the specified subnets. If no subnets are specified, DataFlow provisions a load balancer according to how the subnets have been tagged. For more information, see VPC and subnets in the Related information section below.

    4. Specify Load Balancer Endpoint Access.

      Specify a set of IP address ranges that will be allowed to access the DataFlow load balancer endpoint. Providing no IP address ranges makes the load balancer endpoint open to all traffic.

  5. Configure the Kubernetes API Server Endpoint Access.

    CDP environments with Cluster Connectivity Manager (CCM) enabled support the ability to create a fully private cluster, which disallows all access to the Kubernetes API Server Endpoint.

    1. Specify whether to use a Private Cluster.
      • If Private Cluster is selected, IP based access is not applicable, and the option to specify a set of IP address ranges is not available.

      • If Private Cluster is not selected, you can specify a set of IP address ranges that will be allowed to access the Kubernetes API Server Endpoint. Providing no Classless Inter-Domain Routing (CIDR) makes the Kubernetes API Server Endpoint open to all traffic.

    In either case, any user who needs access to the Kubernetes API Server must be granted remote access to the underlying Kubernetes cluster. This can be configured after DataFlow has been enabled successfully.

  6. Configure Tags as keys and their values.

    Tags are added to DataFlow resources at the time of enablement. These tags are included in addition to those set by the CDP service.

  7. Click Enable. This may take up to 45 minutes.

Result

Your cluster status changes from Not Enabled to Enabling.

  • Hover over Enabling for environment enablement event messages to display.

  • Click the Alerts tab to see environment enablement event messages.

  • Click anywhere in your environment row to see your environment details.

Before you begin

  • You have installed the CDP CLI.
  • You have run cdp environments list-environments # to obtain the environment-crn.

Steps

  1. To enable DataFlow for an environment, enter:
    
    cdp df enable-service 
    --environment-crn [***ENVIRONMENT_CRN***] 
    --min-k8s-node-count [***MIN_K8S_NODE_COUNT***]
    --max-k8s-node-count [***MAX_K8S_NODE_COUNT***] 
    [--use-public-load-balancer] [--no-use-public-load-balancer] 
    [--private-cluster] [--no-private-cluster] 
    [--kube-api-authorized-ip-ranges [[***KUBE_API_AUTHORIZED_IP_RANGES***[ [[***KUBE_API_AUTHORIZED_IP_RANGES***[ ...]]]
    [--tags [***TAGS***]]
    [--load-balancer-authorized-ip-ranges [[***LOAD_BALANCER_AUTHORIZED_IP_RANGES***] [[***LOAD_BALANCER_AUTHORIZED_IP_RANGES***] ...]]]
    [--cluster-subnets [[***CLUSTER_SUBNETS***] [[***CLUSTER_SUBNETS***] ...]]]
    [--load-balancer-subnets [[***LOAD_BALANCER_SUBNETS***] [[***LOAD_BALANCER_SUBNETS***] ...]]]
    

    Where:

    • --environment-crn specifies the environment-crn you obtained while completing the pre-requisites.
    • --min-k8s-node-count and --max-k8s-node-count specify your DataFlow capacity. Your DataFlow cluster automatically scales between the minimum and maximum cluster size that you specify here.
    • [--use-public-load-balancer] [--no-use-public-load-balancer]
    • [--private-cluster] [--no-private-cluster]
    • --kube-api-authorized-ip-ranges specifies a set of IP address ranges that will be allowed to access the Kubernetes API Server Endpoint. Providing no IP address ranges makes the Kubernetes API Server Endpoint open to all traffic.
    • --tags specifies any tags you want to add to DataFlow resources at the time of enablement. These tags are in addition to those set by the CDP service.
    • --load-balancer-authorized-ip-ranges
    • --cluster-subnets
    • --load-balancer-subnets

Example

Successfully enabling DataFlow for an AWS environment results in output similar to:

{
  "service": {
    "kubeApiAuthorizedIpRanges": [],
    "validActions": [
      "ENABLE"
    ],
    "loadBalancerAuthorizedIpRanges": [],
    "clusterSubnets": [],
    "loadBalancerSubnets": [],
    "tags": {},
    "crn": "crn:cdp:df:us-west-1:CLOUDERA:service:96827a6b-bd8b-42fb-85ad-bdb4fc7ca43d",
    "environmentCrn": "crn:cdp:environments:us-west-1:CLOUDERA:environment:dev-east-local-k8s",
    "name": "dev-east-local-k8s",
    "cloudPlatform": "LOCAL_K8S",
    "region": "local-k8s",
    "deploymentCount": 0,
    "minK8sNodeCount": 3,
    "maxK8sNodeCount": 4,
    "status": {
      "state": "ENABLING",
      "message": "Enabling DataFlow crn:cdp:environments:us-west-1:CLOUDERA:environment:dev-east-local-k8s",
      "detailedState": "NEW"
    },
    "runningK8sNodeCount": 0,
    "instanceType": "m5.xlarge",
    "dfLocalUrl": "",
    "activeWarningAlertCount": "0",
    "activeErrorAlertCount": "0",
    "clusterId": "",
    "clusterUsable": false,
    "usePublicLoadBalancer": false,
    "usePrivateCluster": false,
    "creatingK8sNodeCount": 0,
    "terminatingK8sNodeCount": 0
  }
}

After you finish

Once you have enabled your DataFlow environment, you are ready to deploy your first flow definition from the Catalog. For instructions, see Deploying a flow definition.

For more information on managing and monitoring Dataflow, see Managing DataFlow in an environment.