ReadyFlow: Kafka to S3 Avro

Learn about the Kafka to S3 Avro ReadyFlow.

This ReadyFlow consumes JSON, CSV or Avro data from a source Kafka topic and merges the events into Avro files before writing the data to S3. The flow writes out a file every time its size has either reached 100MB or five minutes have passed. Files can reach a maximum size of 1GB. You can specify the topic you want to read from as well as the target S3 bucket and path. Failed S3 write operations are retried automatically to handle transient issues. Define a KPI on the failure_WriteToS3 connection to monitor failed write operations.

ReadyFlow details
Source Kafka topic
Source Format JSON, CSV, Avro
Destination Amazon S3
Destination Format Avro
Table 1. Kafka to S3 Avro ReadyFlow configuration parameters
Parameter Name Description Example
CDP Workload User Specify the CDP machine user or workload user name that you want to use to authenticate to Kafka and the object store. Ensure this user has the appropriate access rights in Ranger for the Kafka topic and ID Broker for object store access.
CDP Workload User Password Specify the password of the CDP machine user or workload user you're using to authenticate against Kafka and the object store.
CDPEnvironment

DataFlow will use this parameter to auto-populate the Flow Deployment with Hadoop configuration files required to interact with S3.

DataFlow automatically adds all required configuration files to interact with Data Lake services. Unnecessary files that are added won't impact the deployment process.

CSV Delimiter If your source data is CSV, specify the delimiter here.
Data Input Format Specify the format of your input data.
  • CSV
  • JSON
  • AVRO
Kafka Broker Endpoint Specify the Kafka bootstrap servers string as a comma separated list.
Kafka Consumer Group ID The name of the consumer group used for the the source topic you are consuming from.
Kafka Source Topic Specify a topic name that you want to read from.
S3 Bucket Specify the name of the S3 bucket you want to write to.

The full path will be constructed from:

s3a://#{S3 Bucket}/#{S3 Path}/${Kafka.topic}

S3 Bucket Region

Specify the AWS region in which your bucket was created.

Supported values are:

  • us-gov-west-1

  • us-gov-east-1

  • us-east-1

  • us-east-2

  • us-west-1

  • us-west-2

  • eu-west-1

  • eu-west-2

  • eu-west-3

  • eu-central-1

  • eu-north-1

  • eu-south-1

  • ap-east-1

  • ap-south-1

  • ap-southeast-1

  • ap-southeast-2

  • ap-northeast-1

  • ap-northeast-2

  • ap-northeast-3

  • sa-east-1

  • cn-north-1

  • cn-northwest-1

  • ca-central-1

  • me-south-1

  • af-south-1

  • us-iso-east-1

  • us-isob-east-1

  • us-iso-west-1

S3 Path Specify the path within the bucket where you want to write to without any leading characters.

The full path will be constructed from:

s3a://#{S3 Bucket}/#{S3 Path}/${Kafka.topic}

Schema Name Specify the schema name to be looked up in the Schema Registry for the source Kafka topic.
Schema Registry Hostname Specify the hostname of the Schema Registry you want to connect to. This must be the direct hostname of the Schema Registry itself, not the Knox Endpoint.