ReadyFlow: S3 to S3 Avro with S3 Notifications

Learn about the S3 to S3 Avro with S3 Notifications ReadyFlow.

This ReadyFlow consumes JSON, CSV, or Avro files from a source S3 location, converts the files into Avro and then writes them to a destination S3 location. You can specify the source format, the source and target location and the schema to use for reading the source data. The ReadyFlow is notified about the new files in the source bucket by configuring S3 notifications in AWS.

ReadyFlow details
Source Amazon S3 bucket
Source Format JSON, CSV, Avro
Destination Amazon S3 bucket
Destination Format Avro
Table 1. S3 to S3 Avro with S3 Notifications ReadyFlow configuration parameters
Parameter Name Description Example
CDP Workload User Specify the CDP machine user or workload user name that you want to use to authenticate to the object stores (via IDBroker) and to the schema registry. Ensure this user has the appropriate access rights to the object store locations and to the schema registry.
CDP Workload User Password Specify the password of the CDP machine user or workload user you're using to authenticate against Kafka and the object store.
CDPEnvironment

DataFlow will use this parameter to auto-populate the Flow Deployment with Hadoop configuration files required to interact with S3.

DataFlow automatically adds all required configuration files to interact with Data Lake services. Unnecessary files that are added won't impact the deployment process.

CSV Delimiter If your source data is CSV, specify the delimiter here.
Data Input Format Specify the format of your input data.

JSON

CSV

Avro

Destination S3 Bucket Specify the name of the destination S3 bucket you want to write to. The full path will be constructed out of s3a://#{Destination S3 Bucket}/#{Destination S3 Path}/[subdirectory]/${filename} ([subdirectory] is the original directory of the file if the #{Source S3 Path} prefix is not specified)
Destination S3 Path Specify the path within the destination bucket where you want to write to. The full path will be constructed out of s3a://#{Destination S3 Bucket}/#{Destination S3 Path}/[subdirectory]/${filename} ([subdirectory] is the original directory of the file if the #{Source S3 Path} prefix is not specified)
Notification SQS Queue URL URL of the SQS Queue set up for delivering notifications about S3 object creation events. The SQS Queue must be in the same region as the source S3 Bucket and the bucket needs to be configured to send notifications to the queue.
Schema Name Specify the schema name to be looked up in the Schema Registry used to parse the source files.
Schema Registry Hostname Specify the hostname of the Schema Registry you want to connect to. This must be the direct hostname of the Schema Registry itself, not the Knox Endpoint.
Source S3 Bucket Specify the name of the source S3 bucket you want to read from.
Source S3 Path Specify the path within the source bucket where you want to read files from. If notifications are received with a source object key not matching the Source S3 Path (as a prefix), they will be filtered out and will not be loaded. The Source S3 Path will be stripped off from the source object key and will be replaced with the Destination S3 Path in the destination object key.
Source S3 Region Specify the AWS region of your source S3 Bucket and S3 Event Notification SQS Queue. Supported values are: us-gov-west-1, us-gov-east-1, us-east-1, us-east-2, us-west-1, us-west-2, eu-west-1, eu-west-2, eu-west-3, eu-central-1, eu-north-1, eu-south-1, ap-east-1, ap-south-1, ap-southeast-1, ap-southeast-2, ap-northeast-1, ap-northeast-2, ap-northeast-3, sa-east-1, cn-north-1, cn-northwest-1, ca-central-1, me-south-1, af-south-1, us-iso-east-1, us-isob-east-1, us-iso-west-1