Cloudera DataFlow key concepts

Learn about the key concepts and terms used in Cloudera DataFlow.

Catalog

The Catalog is where your flow definitions are stored and where you manage the Cloudera DataFlow flow definition lifecycle from import through versioning to deletion. The Catalog is also the place from where you can initiate new deployments.

Deployments

The Deployments view is the central monitoring component within Cloudera DataFlow showing all flow deployments across environments at a glance. For each flow deployment, you can open the Deployment Details pane, which shows you the KPIs you have defined, system metrics, as well as system events and alerts.

Deployment Manager

The Deployment Manager allows you to review and modify flow deployment parameters, settings for size and scaling, and KPI and alert definitions. It also allows you to initiate NiFi version upgrades, access the NiFi canvas of your flow deployments as well as terminate them. Click the Actions > Manage Deployment in the Deployment Details pane to access the Deployment Manager.

Environment

Cloudera DataFlow works in the context of Cloudera environments. You can enable the Cloudera DataFlow service for any supported environment you have registered with Cloudera. The enablement process creates the Kubernetes infrastructure required by Cloudera DataFlow and each environment maps to one Kubernetes cluster.

Once Cloudera DataFlow has been enabled for an environment, you can start deploying flow definitions to it.

Flow definition

A flow definition represents the data flow logic developed in Cloudera DataFlow's Flow Designer and published to the Catalog; or developed in Apache NiFi and exported by using the Download Flow Definition action on a NiFi process group or the root canvas. Flow definitions typically leverage parameterization to make the flows portable between for example development and production NiFi environments.

To run an existing NiFi data flow in Cloudera DataFlow, you have to export it as a flow definition and upload it to the Cloudera DataFlow Catalog.

Flow deployment

A flow deployment represents a NiFi cluster running on Kubernetes and executing a specific flow definition. When you initiate the flow deployment process from the Catalog, a deployment wizard helps you turn a flow definition into a flow deployment. When using the wizard, specify your environment, provide configuration parameters, auto-scaling settings and KPI definitions for your flow deployment.

Function

A function is a flow that is uploaded into the Cloudera DataFlow Catalog and that can be run in serverless mode by serverless cloud provider services.

KPI

Apache NiFi has multiple metrics to monitor the different statistics of the system such as memory usage, CPU usage, data flow statistics, and so on. Key Performance Indicators (KPIs) are representations of those metrics for a NiFi component in Cloudera DataFlow. They provide a critical monitoring tool for a real-time view into your data flow performance.

NiFi node

In a flow deployment, a NiFi node is a pod provisioned in the underlying Kubernetes (K8s) cluster. It does not directly relate to a Virtual Machine (VM) of the underlying K8s cluster.

You specify the allowed number of VMs in the K8s cluster when you configure the minimum and maximum number of K8s nodes, and autoscaling, while enabling Cloudera DataFlow for a Cloudera environment.

You specify the allowed number of NiFi nodes during flow deployment. These nodes are then provisioned in the K8s cluster on one or several VMs, depending on resource allocation. Auto-scaling of the NiFi nodes of a deployment may or may not trigger the auto-scaling of the underlying K8s cluster, if configured and depending on the current resources allocation.

Parameter group

A parameter group is a set of shared parameters that can be reused within the project it is currently assigned to. Using shared parameter groups facilitates flow development and flow deployment.

Project

A Project is a container for a set of Cloudera DataFlow resources that restricts visibility of resources associated with it.

ReadyFlow

A ReadyFlow is a predefined, out-of-the-box data flow which can be immediately deployed by providing a small set of required parameters.

ReadyFlow Gallery

The ReadyFlow Gallery is where you find all available ReadyFlows. To use a ReadyFlow, you need to add it from the ReadyFlow Gallery to the Catalog and then use it to create a flow deployment.

Resource

Flow deployments, flow drafts, parameter groups, inbound connections, custom NAR configs, and custom Python configs are collectively called resources in Cloudera DataFlow. You can view and manage them from the Resources view.

Workspace

The Workspace view displays all resources within an Environment, making it easier to switch between them and managing them.