Cloudera Data Flow key concepts
Learn about the key concepts and terms used in Cloudera Data Flow.
Catalog
The Catalog is where your flow definitions are stored and where you manage the Cloudera Data Flow flow definition lifecycle from import through versioning to deletion. The Catalog is also the place from where you can initiate new deployments.
Collection
A Collection is a logical container within the Cloudera Data Flow Catalog that groups flow definitions to enable both organization and fine-grained access control. Flow definitions assigned to a collection are only visible to users with the appropriate permissions for that collection. A flow definition can belong to only one collection at a time, and removing it from a collection makes it accessible to all users with sufficient Catalog permissions.
Deployments
The Deployments view is the central monitoring component within Cloudera Data Flow showing all flow deployments across environments at a glance. For each flow deployment, you can open the Deployment Details pane, which shows you the KPIs you have defined, system metrics, as well as system events and alerts.
Deployment Manager
The Deployment Manager allows you to review and modify flow deployment parameters, settings for size and scaling, and KPI and alert definitions. It also allows you to initiate NiFi version upgrades, access the NiFi canvas of your flow deployments as well as terminate them. Click the Deployment Details pane to access the Deployment Manager.
in theEnvironment
Cloudera Data Flow works in the context of Cloudera environments. You can enable the Cloudera Data Flow service for any supported environment you have registered with Cloudera. The enablement process creates the Kubernetes infrastructure required by Cloudera Data Flow and each environment maps to one Kubernetes cluster.
Once Cloudera Data Flow has been enabled for an environment, you can start deploying flow definitions to it.
Flow definition
A flow definition represents the data flow logic developed in Cloudera Data Flow's Flow Designer and published to the Catalog; or developed in Apache NiFi and exported by using the Download Flow Definition action on a NiFi process group or the root canvas. Flow definitions typically leverage parameterization to make the flows portable between for example development and production NiFi environments.
To run an existing NiFi data flow in Cloudera Data Flow, you have to export it as a flow definition and upload it to the Cloudera Data Flow Catalog.
Flow deployment
A flow deployment represents a NiFi cluster running on Kubernetes and executing a specific flow definition. When you initiate the flow deployment process from the Catalog, a deployment wizard helps you turn a flow definition into a flow deployment. When using the wizard, specify your environment, provide configuration parameters, auto-scaling settings and KPI definitions for your flow deployment.
Function
A function is a flow that is uploaded into the Cloudera Data Flow Catalog and that can be run in serverless mode by serverless cloud provider services.
KPI
Apache NiFi has multiple metrics to monitor the different statistics of the system such as memory usage, CPU usage, data flow statistics, and so on. Key Performance Indicators (KPIs) are representations of those metrics for a NiFi component in Cloudera Data Flow. They provide a critical monitoring tool for a real-time view into your data flow performance.
NiFi node
In a flow deployment, a NiFi node is a pod provisioned in the underlying Kubernetes (K8s) cluster. It does not directly relate to a Virtual Machine (VM) of the underlying K8s cluster.
You specify the allowed number of VMs in the K8s cluster when you configure the minimum and maximum number of K8s nodes, and autoscaling, while enabling Cloudera Data Flow for a Cloudera environment.
You specify the allowed number of NiFi nodes during flow deployment. These nodes are then provisioned in the K8s cluster on one or several VMs, depending on resource allocation. Auto-scaling of the NiFi nodes of a deployment may or may not trigger the auto-scaling of the underlying K8s cluster, if configured and depending on the current resources allocation.
Parameter group
A parameter group is a set of shared parameters that can be reused within the project it is currently assigned to. Using shared parameter groups facilitates flow development and flow deployment.
Project
A Project is a container for a set of Cloudera Data Flow resources that restricts visibility of resources associated with it.
ReadyFlow
A ReadyFlow is a predefined, out-of-the-box data flow which can be immediately deployed by providing a small set of required parameters.
ReadyFlow Gallery
The ReadyFlow Gallery is where you find all available ReadyFlows. To use a ReadyFlow, you need to add it from the ReadyFlow Gallery to the Catalog and then use it to create a flow deployment.
Resource
Flow deployments, flow drafts, parameter groups, inbound connections, custom NAR configs, and custom Python configs are collectively called resources in Cloudera Data Flow. You can view and manage them from the Resources view.
Workspace
The Workspace view displays all resources within an Environment, making it easier to switch between them and managing them.