Defining your Cloudera Base on premises data flow

To move data between cloud environments using NiFi site-to-site communication, you require a data flow in your Cloudera Base on premises cluster that can send and receive data from the Cloudera on cloud cluster. To create this data flow, connect a processor to a Remote Process Group configured with HTTP and enable transmission.

  • You have defined your Cloudera on cloud data flow and configured Ranger policies for site-to site communication.
  • You have the public FQDNs for your Cloudera on cloud cluster nodes.
  1. In your Cloudera Base on premises cluster, launch the NiFi UI and drag a GenerateFlowFile processor onto the canvas.
    For this use case, GenerateFlowFile creates 1MB files every 10 seconds.
  2. Drag a Remote Process Group onto the NiFi canvas, configure HTTP protocol, and specify one or more of the NiFi nodes running on your Cloudera on cloud cluster.
    After the site-to-site connection is initiated, the source NiFi cluster is aware of the topology of the remote NiFi cluster and of any increase or decrease of the size of the remote cluster. However, it is recommended that you specify at least 2 nodes to ensure higher availability when the site-to-site connection is initiated.
  3. Right-click the Remote Process group and select Enable transmission.
  4. Connect the GenerateFlowFile processor to the Remote Process Group and select the Input Port that you created and started on the remote cluster in Cloudera on cloud:
  5. You can also define a connection from the Remote Process Group to another component to download data made available by the remote cluster running in the Cloudera on cloud environment. In the this example, the Remote Process Group is connected to a funnel.

After you have defined the data flow for your Cloudera Base on premises cluster, start the Cloudera Base on premises data flow and confirm that the data is moving back and forth between the environments:

In the Cloudera Base on premises environment, your data flow looks similar to the following:

In the Cloudera on cloud environment, your data flow will look similar to the following: