Comparison of deployments and functions in Cloudera DataFlow

This topic explains the differences between deployments and functions in the context of Cloudera DataFlow for the Public Cloud (CDF-PC). It helps you to understand when to use one or the other from a feature as well as from a cost perspective.

CDF-PC allows you to run NiFi flows on Kubernetes clusters, providing you with an opportunity to run NiFi flows efficiently at scale. However, for a specific set of use cases, it is more beneficial to go one step further by using DataFlow Functions. DataFlow Functions is a CDF-PC extension to run NiFi flows as functions for event-driven use cases, where Apache NiFi provides a no-code UI for building and running functions efficiently.

DataFlow Deployments provides a cloud-native runtime to run your Apache NiFi flows through auto- scaling Kubernetes clusters. It also provides a centralized monitoring and alerting capability that results in improved Software Development Life Cycle (SDLC) for developers. To run a DataFlow deployment, a Kubernetes cluster (or CDF environment) needs to be provisioned and at least one virtual machine needs to be always up and running. Additional resources will be provisioned depending on your requirements and according to resource consumption by the running flows. The total cost of ownership (TCO) consists of the cloud provider’s and Cloudera's costs. The cloud provider's costs include running the Kubernetes cluster through its native service and are based on the number of virtual machines being used. Cloudera’s costs are based on a Concurrent User (CCU) pricing for the running flows which is based on the time period while the flows are running.

DataFlow Functions provides a cloud-native runtime to run your Apache NiFi flows as functions on the serverless compute services of three cloud providers (AWS Lambda, Azure Functions, and Google Cloud Functions). It is particularly powerful when the flow does not require NiFi resources to be always up and running. The use cases are event-driven object store processing, microservices that power serverless web applications, IoT data processing, asynchronous API gateway request processing, batch file processing, job automation with cron/timer scheduling, and so on. For these use cases, the NiFi flows need to be treated like jobs with a distinct start and end. The start is based on a trigger event like a file landing in an object store, the start of a cron event, a gateway endpoint being invoked, and so on. Resources are only provisioned by the cloud provider for the duration required by the flow for processing the trigger event. The TCO consists of the cloud provider’s costs depending on the amount and duration (in milliseconds) of resources provisioned with the function a and of Cloudera’s costs, based on the number of function invocations and the execution time in case the processing of a single event takes more than one second.

DataFlow Deployments DataFlow Functions
Supported cloud providers AWS, Azure AWS, Azure, Google Cloud
Runtime

AWS EKS

Azure AKS

AWS Lambda

Azure Functions

Google Cloud Functions

Cloudera pricing CCU based – Per minute for resources used by the running flows Per invocation
Supports NiFi clustering Yes No
Auto-scaling

Yes – NiFi cluster scales up and down based on CPU consumption.

Minimum number of one pod for a given running flow.

Yes – Based on how many concurrent events need to be processed.

Multiple instances of the function can be executed at the same time.

Supports all NiFi components Yes Yes
Resource limitations Yes – A given NiFi pod can be provisioned with a maximum of 12vCore and 24 GB of memory Yes – A function instance cannot be provisioned with more than a given amount of memory and CPU (depends on the cloud provider)
Duration limitations No – A given flow can be running forever Yes – A function execution cannot exceed about 15 minutes (depending on the cloud provider)
Access to the NiFi UI Yes – The NiFi UI can be accessed for a running flow No – DataFlow Functions is powered by Stateless NiFi which does not provide a UI
Serverless No – There is always a minimum amount of resources up and running (for the Kubernetes cluster and for a running flow) Yes – No infrastructure to manage and no running resources when no events to be processed
In-memory processing No – Data going through NiFi requires writes on attached volumes Yes – Unless specified otherwise, processing is in-memory to improve performances
Multiple sources / multiple destinations Yes – There is no limit in the flow definition complexity No – DataFlow Functions is designed for simple event processing where a given event is processed and sent to one destination

From a pure technical point of view, any flow that can run as a DataFlow function can also run as a DataFlow deployment. Choosing one or the other depends on various considerations:

Trigger
Using DataFlow Functions assumes that the use case has a trigger mechanism that can be implemented at the cloud provider level. Most of the use cases are covered by the existing triggers but some are not. For example, a flow that listens for events over TCP or UDP do not have a corresponding trigger available.
Cost
There are many parameters to take into account to perform an exhaustive cost analysis between the two options (the flow design, the required resources, the event execution time, and so on), but DataFlow Functions is more cost efficient up to one million events processed per month.
SLA
Depending on how the function is configured, a cold start may happen. In a serverless architecture, a cold start refers to the required time for provisioning the resources that will run the function. With DataFlow Functions, it is the addition of the time required for provisioning the container and the time required for Stateless NiFi to start and load the components required for executing the NiFi flow. It may go from a few seconds to a minute depending on the function’s configuration. A cold start only happens when the function has not been triggered for some time (it depends on the cloud provider). It is also possible to completely remove any cold start at additional cost on the cloud provider’s side.
Serverless
DataFlow Functions relies on the cloud provider for provisioning resources when the events need to be processed. There are no infrastructure considerations required for upgrades, patches, maintenance, monitoring, and so on. DataFlow Functions delegates the scalability to the cloud provider and can virtually scale infinitely to handle very bursty use cases.

Use cases

A few example use cases for DataFlow Functions:

  • Serverless data processing pipelines: Develop and run your data processing pipelines when files are created or updated in any of the cloud object stores (for example: when a photo is uploaded to object storage, a data flow is triggered which runs image resizing code and delivers resized image to different locations to be consumed by web, mobile, and tablets).

  • Serverless workflows/orchestration: Chain different low-code functions to build complex workflows (for example: automate the handling of support tickets in a call center).

  • Serverless scheduled tasks: Develop and run scheduled tasks without any code on pre-defined timed intervals (for example: offload an external database running on-premises into the cloud once a day every morning at 4:00 a.m.).

  • Serverless IOT event processing: Collect, process, and move data from IOT devices with serverless IOT processing endpoints (for example: telemetry data from oil rig sensors that need to be filtered, enriched, and routed to different services are batched every few hours and sent to a cloud storage staging area).

  • Serverless Microservices: Build and deploy serverless independent modules that power your applications microservices architecture (for example: event-driven functions for easy communication between thousands of decoupled services that power a ride-sharing application).

  • Serverless Web APIs: Easily build endpoints for your web applications with HTTP APIs without any code using DFF and any of the cloud providers' function triggers (for example: build high performant, scalable web applications across multiple data centers).

  • Serverless Customized Triggers: With the DFF State feature, build flows to create customized triggers allowing access to on-premises or external services (for example: near real time offloading of files from a remote SFTP server).

  • Serverless Stream Processing: Easily process messages in real time as they arrive in your messaging queue service (Kinesis, Kafka, Pub/Sub, EventHub, and so on).

For further functions case studies, see AWS Lambda, Azure Functions, and Google Cloud Functions.

You should consider DataFlow Deployments over DataFlow Functions when:

  • A single event (file, request, message, etc) needs to go to multiple destinations. DataFlow Functions is better suited for use cases with a single source and a single destination.

  • More than 1 million invocations / processing seconds per month are expected. In this case, DataFlow Deployments would likely be more cost effective.

  • Use cases are Listen based (other than HTTP) such as ListenTCP, ListenUDP, and so on.

  • The data needs to be persisted across restarts (for example, the data source is not replayable).

  • Buffering / merging of multiple events before sending to destination is required. DataFlow Functions provides single event processing.

  • Data to be processed is extremely large (multiple GB). While DataFlow Functions can also be used, the addition of file store / ephemeral storage will incur additional costs.

  • The use case cannot afford a cold start and should always offer very low latency. DataFlow Functions can be configured with always running instances at additional cost.

  • The workload would benefit from the NiFi clustering and auto-scaling capabilities.

  • The user would like to use command and control features of the NiFi UI to stop and update parts of the running flow as well as to monitor the processing of the data in the NiFi UI.

  • The user has a large set of use cases and running everything on a single Kubernetes cluster might provide a lower TCO.