Use case

The DataFlow Functions Quickstart presents a telemetry use case, where telemetry data from various sensors are batched in a file and sent to a landing zone in a cloud storage bucket on AWS S3. Get familiar with the key functional and non-functional requirements.

Each line in the telemetry batch file represents one sensor event with the following schema:
{
	"fields": [{"name": "eventTime", "type": "string"},
			   {"name": "eventTimeLong","type": "long","default": 0},
			   {"name": "eventSource","type": "string"},
			   {"name": "truckId","type": "int"},
			   {"name": "driverId","type": "int"},
			   {"name": "driverName","type": "string"},
			   {"name": "routeId","type": "int"},
			   {"name": "route","type": "string"},
			   {"name": "eventType", "type": "string","default": "None"},
			   {"name": "latitude", "type": "double","default": 0},
			   {"name": "longitude","type": "double","default": 0},
			   {"name": "correlationId","type": "long","default": 0},
			   {"name": "speed","type": "int","default": 0}
			]
}

The telemetry batch files are sent periodically throughout the day. When a telemetry batch file lands in S3, the events in the file need to be routed, filtered, enriched and transformed into parquet format and then stored in S3. Since the files are sent periodically and do not require constantly running resources, the need to have a true pay-for-compute model is critical. Once the file is processed, the function and all corresponding resources should be shut down and usage charges should only be attributed for the time the function was executed. The use case provides a cost effective solution with trade-off on high throughput.

Key functional requirements

Routing
Events in the telemetry file need to send to different S3 locations based on the “eventSource” value.
Filtering
Certain events need to be filtered based on various rules (speed > x).
Enrichment
Geo events need to be enriched with geo location based on the lat/long value using a lookup service.
Format Conversion
The events need to be transformed from json to parquet based on the provided schema.
Delivery - The filtered and enriched data in parquet format needs to be delivered to different S3 locations.

Key non-functional requirements

Agile low code development
Provide a low-code development environment to develop the processing logic with strong SDLC capabilities including developing and testing locally with test data and promoting to production in the cloud.
Serverless
The telemetry processing code needs to be run without provisioning or managing infrastructure.
Trigger based processing
The processing code and related resources should only be spun up when new file lands, and once processing completes, all resources should be shut down. The need for long running resources is not required.
Pay only for compute time
Only pay for compute time used by the processing code and should not require provisioning infrastructure upfront for peak capacity.
Scale
Support any scale from processing a few files a day to hundreds of files per second.

Implementation

The DataFlow Functions Quickstart shows how you can implement these functional and non-functional requirements using Apache NiFi and the Cloudera DataFlow service to implement the above functional and non-functional requirements and how you can run your functions in serverless mode using AWS Lambda service See the below diagram for the overall process: