Cloudera DataFlow Overview

Learn about key Cloudera DataFlow concepts.

Cloudera DataFlow (CDF) is a CDP Public Cloud service that enables self-serve deployments of Apache NiFi data flows from a central catalog to auto-scaling Kubernetes clusters managed by CDP. Flow deployments can be monitored from a central dashboard with the ability to define KPIs to keep track of critical data flow metrics.

CDF eliminates the operational overhead that is typically associated with running Apache NiFi clusters and allows users to fully focus on developing data flows and ensuring they meet business SLAs.

Key Concepts

Flow definition
A flow definition represents data flow logic which was developed in Apache NiFi and exported by using the Download Flow Definition action on a NiFi process group or the root canvas. Flow definitions typically leverage parameterization to make them portable between different environments such as development or production NiFi environments.
To run one of your existing NiFi data flows in CDF you have to export it as a flow definition and upload it to the CDF catalog.
Catalog
The CDF Catalog is one of three main pages on the CDF UI. The catalog is where your flow definitions are stored and where you manage the CDF flow definition lifecycle from import to versioning, and deletion. The catalog is also where users can initiate new deployments from.
Flow deployment
When you initiate the flow deployment process from the CDF catalog, the deployment wizard helps you turn a flow definition into a flow deployment which is processing data. Using the deployment wizard, you supply configuration parameters, auto-scaling settings and KPI definitions for your flow deployment.
For each flow deployment, CDF creates a dedicated, auto-scaling NiFi cluster on the shared Kubernetes resources in an environment. Flow deployments can therefore scale independently from each other, allowing users to isolate flow deployments from each other and assign resources to deployments as needed.
You can assign KPIs to flow deployments, monitor them in the CDF Dashboard and manage their lifecycle through the Deployment Manager.
Deployment manager
The Deployment Manager allows you to access the NiFi canvas of your flow deployments as well as terminate them. Click the Manage Deployment link in the deployment details pane to get to the Deployment Manager.
KPIs
KPIs (Key Performance Indicators) are a critical monitoring tool introduced in CDF to provide a real-time view into your dataflow performance.
You can use Key Performance Indicators (KPIs) to monitor critical parts of your NiFi deployments on the central monitoring dashboard. You do not need to drill deep into the NiFi dataflow and find the metric to monitor it in NiFi. You can also choose to create alerts for your KPIs in Cloudera DataFlow.
Apache NiFi has multiple metrics to monitor the different statistics of the system such as memory usage, CPU usage, data flow statistics and so on. KPIs are representations of those metrics for a NiFi component in Cloudera DataFlow.
Dashboard
The Dashboard is the central monitoring component within CDF showing all flow deployments across environments at a glance. For each flow deployment you can open the deployment details pane which shows you KPIs you have defined, system metrics as well as system events and alerts.
Environments
CDF works in the context of CDP environments. You can enable DataFlow for any AWS environment you have registered with CDP. The enablement process creates the Kubernetes infrastructure required by CDF and each environment maps to one Kubernetes cluster.

Once DataFlow has been enabled for an environment, you can start deploying flow definitions to it.

Flow isolation
CDF is the first cloud service allowing NiFi users to easily isolate data flows from each other and guarantee a set of resources to each one without requiring administrators to create additional NiFi clusters.
Flow isolation describes the ability to treat NiFi process groups which typically run on a shared cluster on shared resources as independent, deployable artifacts which can be exported as flow definitions from NiFi.
Flow isolation is useful when:
  • You want to guarantee a set of resources for a specific dataflow.

  • You want to isolate failure domains.

Auto-scaling
CDF is the first cloud service providing auto-scaling capabilities for Apache NiFi data flows. Flow deployments scale up and down based on CPU utilization within the boundaries that are set when completing the deployment wizard. CDF scales flow deployments by adding or removing NiFi pods on the Kubernetes cluster as needed as well as scaling the Kubernetes cluster up or down within boundaries specified during DataFlow enablement.