Build the data flow

Learn how you can create an ingest data flow to move data to Google Cloud Storage (GCS) buckets. This involves opening Apache NiFi in your Flow Management cluster, adding processors and other data flow objects to your canvas, and connecting your data flow elements.

You can use the PutHDFS or PutGCSObject processors to build your Google Cloud ingest data flows. Regardless of the type of flow you are building, the first steps in building your data flow are generally the same. Open NiFi, add your processors to the canvas, and connect the processors to create the flow.

  1. Open NiFi 
in 
Data Hub.
    1. To access the NiFi service in your Flow Management 
Data Hub cluster, navigate to Management Console service > Data Hub Clusters.
    2. Click the tile representing the Flow Management Data Hub cluster you want to work with.
    3. Click Nifi in the Services section of the cluster overview page to access the NiFi UI.
    You will be logged into NiFi automatically with your CDP credentials.
  2. Add the TailFile processor for data input in your data flow.
    1. Drag and drop the processor icon into the canvas.
      This displays a dialog that allows you to choose the processor you want to add.
    2. Select the TailFile processor from the list.
    3. Click Add or double-click the required processor type to add it to the canvas.
  3. Add the MergeContent processor.

    When using the TailFile processor, you are pulling small-sized flow files, so it is practical to merge them into larger files before writing them to GCS.

  4. Add a processor for writing data to GCS.
    You have two options:
    • PutHDFS processor: The HDFS client writes to GCS through the GCS API. This solution leverages centralized CDP security. You can use the usernames and passwords you set up in CDP for central authentication, and all requests go through IDBroker.
    • PutGCSObject processor: This is a GCS-specific processor that directly interacts with the Google Cloud Storage API. It is not integrated into CDP’s authentication and authorization frameworks. If you use it for data ingestion, the authorization process is handled in GCP, and you need to provide the Google Cloud access credentials file / information when configuring the processor. You can do this by using the GCPCredentialsControllerService controller service in your data flow.
    1. Drag and drop the processor icon into the canvas.
      In the dialog box you can choose which processor you want to add.
    2. Select the processor of your choice from the list.
    3. Click Add or double-click the required processor type to add it to the canvas.
  5. Connect the processors to create the data flow by clicking the connection icon in the first processor, and dragging and dropping it on the second processor.

    A Create Connection dialog appears with two tabs: Details and Settings. You can configure the connection's name, flowfile expiration time period, thresholds for back pressure, load balance strategy and prioritization.

    1. Connect TailFile with MergeContent.
    2. Add the success flow of the TailFile processor to the MergeContent processor.
    3. Click Add to close the dialog box and add the connection to your data flow.
    4. Connect MergeContent with your target data processor (PutHDFS / PutGCSObject).
    5. Add the merged flow of the MergeContent processor to the target data processor.
    6. Click Add to close the dialog box and add the connection to your data flow.
  6. Optionally, you can add funnels to your flow.
    If you want to know more about working with funnels, see the Apache NiFi User Guide.
This example data flow has been created using the PutHDFS processor.

If you are using the PutHDFS processor, configure IDBroker mapping authorization.

If you are using the PutGCSObjectprocessor, you do not need IDBroker mapping, so proceed to configuring controller services for the processors in your data flow.