Deploying the S3 to S3 Avro ReadyFlow

Learn how to use the Deployment wizard to deploy the S3 to S3 Avro ReadyFlow using the information you collected using the prerequisites check list.

The CDF Catalog is where you manage the flow definition lifecycle, from initial import, to versioning, to deploying a flow definition.

  1. In DataFlow, from the left navigation pane, click Catalog.
    Flow definitions available for you to deploy are displayed, one definition per row.
  2. Launch the Deployment wizard.
    1. Click the row to display the flow definition details and versions.
    2. Click a row representing a flow definition version to display flow definition version details and the Deploy button.
    3. Click Deploy to launch the Deployment wizard.
  3. From the Deployment wizard, select the environment to which you want to deploy this version of your flow definition.
  4. From the Overview, give your flow deployment a unique name and pick the NiFi Runtime Version for your flow deployment .
    • You can use this name to distinguish between different versions of a flow definition, flow definitions deployed to different environments, and similar.

    • You can pick the NiFi Runtime Version for your flow deployment. Cloudera recommends that you always use the latest available version, if possible.

  5. In Parameters, specify parameter values like connection strings, usernames and similar, and upload files like truststores, and similar.
  6. Specify your Sizing & Scaling configurations.
    NiFi node sizing
    • You can adjust the size of your cluster from Extra Small to Large
    Number of NiFi nodes
    • You can set whether you want to automatically scale your cluster according to flow deployment capacity requirements. When you enable autoscaling, the minimum NiFi nodes are used for initial size and the workload scales up or down depending on resource demands.
    • You can set the number of nodes from 1 to 32.
  7. From KPIs, you may choose to identify key performance indicators (KPIs), the metrics to track those KPIs, and when and how to receive alerts about the KPI metrics tracking.

    See Working with KPIs for complete information about the KPIs available to you and how to monitor them.

  8. Review a summary of the information provided and make any necessary edits by clicking Previous. When you are finished, complete your flow deployment by clicking Deploy.

Once you click Deploy, you are being redirected to the Alerts tab in the detail view for the deployment where you can track its progress.

The following parameters are required for the S3 to S3 Avro data flow. You have collected this information in the Meeting the pre-requisites step.

Table 1. S3 to S3 Avro ReadyFlow configuration parameters
Parameter Name Description Example
CDP Workload User Specify the CDP machine user or workload username that you want to use to authenticate Schema Registry. Ensure this user has the appropriate access rights in Ranger.
CDP Workload User Password Specify the password of the CDP machine user or workload user you are using to authenticate against Schema Registry.
CDP Environment

DataFlow uses this parameter to auto-populate the Flow Deployment with Hadoop configuration files required to interact with S3.

DataFlow automatically adds all required configuration files to interact with Data Lake services. Unnecessary files that are added will not impact the deployment process.

Data Input Format

Specify the format of your input data.

If your data input format is CSV, define a CSV delimiter for the data in the CSV Delimiter text box. If you use AVRO or JSON format, the delimiter is ignored.

  • CSV
  • JSON
  • AVRO
Data Output Format

Specify the format of your output data.

As you are using AVRO format, the delimiter is ignored.

AVRO
CSV Delimiter If your source data is CSV, specify the delimiter here.
S3 Bucket Specify the name of the S3 bucket that you want to read from.

The full path is constructed from:

s3a://#{S3 Bucket}/#{S3 Path}/${S3 Bucket}

S3 Path Specify the path within the bucket where you want to read from without any leading characters.

The full path is constructed from:

s3a://#{S3 Bucket}/#{S3 Path}/${S3 Bucket}

S3 Bucket Specify the name of the S3 bucket that you want to write to.

The full path is constructed from:

s3a://#{S3 Bucket}/#{S3 Path}/${S3 Bucket}

S3 Path Specify the path within the bucket where you want to write to without any leading characters.

The full path is constructed from:

s3a://#{S3 Bucket}/#{S3 Path}/${S3 Bucket}

S3 Bucket Region

Specify the AWS region in which your bucket was created.

Supported values are:

  • us-gov-west-1

  • us-gov-east-1

  • us-east-1

  • us-east-2

  • us-west-1

  • us-west-2

  • eu-west-1

  • eu-west-2

  • eu-west-3

  • eu-central-1

  • eu-north-1

  • eu-south-1

  • ap-east-1

  • ap-south-1

  • ap-southeast-1

  • ap-southeast-2

  • ap-northeast-1

  • ap-northeast-2

  • ap-northeast-3

  • sa-east-1

  • cn-north-1

  • cn-northwest-1

  • ca-central-1

  • me-south-1

  • af-south-1

  • us-iso-east-1

  • us-isob-east-1

  • us-iso-west-1

S3 Path Specify the path within the bucket where you want to write to without any leading characters.

The full path will be constructed from:

s3a://#{S3 Bucket}/#{S3 Path}/${Kafka.topic}

Schema Name

Identify the schema that you want to use in your data flow.

DataFlow looks up this schema in the Schema Registry you define with the Schema Registry Hostname. See the Appendix for an example schema.

Schema Registry Hostname

Specify the hostname of the Schema Registry running on the master node in the Streams Messaging cluster that you want to connect to.

This must be the direct hostname of the Schema Registry itself, not the Knox Endpoint.