Building a dataflow in Cloudera Edge Management

You can build an automated dataflow using the Edge Flow Manager UI in Cloudera Edge Management. Simply drag components from the toolbar to the canvas, configure the components to meet specific needs, and connect the components together.

For additional information about flow creation and related concepts, check out the video on the Cloudera Edge Management YouTube playlist:

Adding components to the canvas in Cloudera Edge Management

Learn how to add each of the components available in the Components Toolbar in the Edge Flow Manager UI. You can add processors, remote process groups, and funnels.

Processor

The processor is the most commonly used component, as it is responsible for data ingress, egress, routing, and manipulating. There are many different types of processors. When you drag a processor onto the canvas, the Add Processor dialog appears, as shown in the following image, which allows you to choose which type of processor to use:


You can filter the list based on the processor type by using the Filter items field at top-right corner of the Add Processor dialog. After selecting a processor, you can click the Add button to add the selected processor to the canvas at the location that it was dropped. Alternatively, you can double-click on a processor type.
After you drag a processor onto the canvas, you can configure properties of the processor, parameterize processor property values, or delete the processor. To configure properties, double-click on the processor, or right-click on the processor and select Configure from the context menu. To delete a processor, right-click on the processor and select Delete from the context menu, or highlight the processor and select DELETE on your keyboard. The following image shows the Configure and Delete options in the context menu:


Remote process group

A Remote Process Group (RPG) references a remote instance of NiFi. When you drag an RPG onto the canvas, rather than being prompted for a name, you are prompted for the URL of the remote NiFi instance. If the remote NiFi is clustered, you need to provide at least one URL of any NiFi instance in that cluster. When data is transferred from an RPG running in MiNiFi, the RPG first connects to the remote instance whose URL is configured to determine which nodes are in the cluster and how busy each node is. This information is then used to load balance the data that is pushed to each node. The remote instances are then interrogated periodically to determine information about any nodes that are dropped from or added to the cluster and to recalculate the load balancing based on the load of each node. If the cluster node specified in the URL is down, the RPG cannot establish a connection with the cluster. To mitigate this scenario, you can enter multiple URLs, allowing the RPG to establish a connection with more than one node.

After you drag an RPG onto the canvas, you can configure settings of the RPG or delete the RPG. To configure settings, double-click on the RPG, or right-click on the RPG and select Configure from the context menu. To delete an RPG, right-click on the processor and select Delete from the context menu, or highlight the RPG and select DELETE on your keyboard. The following image shows the Configure and Delete options in the context menu:


Funnel

Funnels are used to combine data from many connections into a single connection. If many connections are created with the same destination, the canvas can become cluttered if those connections have to span a large space. By funneling these connections into a single connection, that single connection can then be drawn to span that large space instead.

To delete a funnel, right-click on the funnel and select Delete from the context menu, or highlight the funnel and select DELETE on your keyboard.

Configuring a processor in Cloudera Edge Management

Learn how to configure a processor using the Edge Flow Manager UI in Cloudera Edge Management.

  1. To configure a processor, right-click on the processor and select the Configure option.
    Alternatively, just double-click on the processor.
    The Configuration dialog opens as shown in the following image:


    The Configuration dialog contains the following sections:

    • Settings. The Settings section contains the following configuration items:
      Properties Description
      Processor Name Allows you to change the name of the processor. The name of a processor by default is the same as the processor type.
      Penalty Duration The amount of time used when a processor penalizes a FlowFile. During the normal course of processing a piece of data (a FlowFile), an event might occur that indicates that the data cannot be processed at this time but the data might be processable at a later time. When this occurs, the processor might choose to penalize the FlowFile. This prevents the FlowFile from being processed for some period of time. For example, if the processor needs to push the data to a remote service, but the remote service already has a file with the same name as the filename that the processor is specifying, the processor might penalize the FlowFile. The penalty duration allows you to specify how long the FlowFile must be penalized. The default value is 30,000 milliseconds.
      Yield Duration When a processor yields, the amount of time that elapses before the processor is re-scheduled is the yield duration. A processor might determine that some situation exists such that the processor can no longer make any progress, regardless of the data that it is processing. For example, if a processor needs to push data to a remote service and that service is not responding, the processor cannot make any progress. As a result, the processor must yield, which prevents the processor from being scheduled to run for some period of time. The default value is 1,000 milliseconds.
      Automatically Terminated Relationships Each of the relationships that is defined by the processor is listed here. In order for a processor to be considered valid, each relationship defined by the processor must be either connected to a downstream component or auto-terminated. If a relationship is auto-terminated, any FlowFile that is routed to that relationship is removed from the flow and its processing is considered as complete.
    • Scheduling. The Scheduling section contains the following configuration items:
      Properties Description
      Scheduling Strategy There are two options for scheduling components:
      • Timer Driven. This is the default mode. The processor is scheduled to run on a regular interval. The interval at which the processor runs is defined by the Run Schedule option (see below).
      • Event Driven. When this mode is selected, the processor is triggered to run by an event, and that event occurs when FlowFiles enter connections feeding this processor. This mode is currently considered experimental and is not supported by all processors. When this mode is selected, the Run Schedule option is not configurable, as the processor is not triggered to run periodically but as the result of an event.
      Concurrent Tasks This controls how many threads the processor uses or how many FlowFiles must be processed by this processor at the same time. Increasing this value allows the processor to handle more data in the same amount of time. However, it does this by using system resources that then are not usable by other processors. This essentially provides a relative weighing of processors. For example, it controls how much resources of the system must be allocated to this processor instead of other processors. This field is available for most processors. There are, however, some types of processors that can only be scheduled with a single concurrent task.
      Run Schedule This dictates how often the processor must be scheduled to run. The valid values for this field depend on the selected scheduling strategy (see above). When you select the Event Driven scheduling strategy, this field is not available. When you select the Timer Driven scheduling strategy, this value is a time duration specified by a number followed by a time unit, for example, 1 second or 5 mins. A value of 0 second means that the processor must run as often as possible as long as it has data to process. This is true for any time duration of 0, regardless of the time unit (for example, 0 sec, 0 mins, 0 days).
      Run Duration This slider controls how long the processor must be scheduled to run each time it is triggered. When a processor finishes running, it must update the repository in order to transfer the FlowFiles to the next connection. Updating the repository is expensive, so the more work that can be done at once before updating the repository, the more work the processor can handle (higher throughput). However, this means that the next processor cannot start processing those FlowFiles until the previous process updates this repository. As a result, the latency (the time required to process the FlowFile from beginning to end) becomes longer. As a result, the slider provides a spectrum from which you can choose to favor Lower Latency or Higher Throughput.
    • Properties. The Properties section provides a mechanism to configure processor-specific behavior. There are no default properties. Each type of processor must define which properties make sense for its use case.
      A GenerateFlowFile processor, by default, has four properties including Batch Size, Data Format, File Size, and Unique FlowFiles. Next to the name of each property, there appears a small question-mark symbol () indicating that additional information is available. Hovering over this symbol with the mouse provides additional details about the property, the default value and whether Expression Language is supported. Here is an example of GenerateFlowFile processor with additional information for the Batch Size property:


      Clicking on the value for the property allows you to change the value. Depending on the values that are allowed for the property, you are either provided a drop-down from which to choose a value, or a text area to type a value. Here is an example of GenerateFlowFile processor with the drop-down for the Data Format property:


      Each of the properties has an arrow in the row showing that they can be converted to parameters. The following image shows the Convert to parameter option for the Unique FlowFiles property:


      For some processors, there appears an Add Property button, beside the Properties section, for adding a user-defined property. When you click this button, a dialog opens, which allows you to enter the name and value of a new property. Not all processors allow user-defined properties. The RouteOnAttribute processor, however, allows user-defined properties. In fact, this Processor will not be valid until you add a property. The following image shows the Add Property button in the Configuration dialog of the RouteOnAttribute processor:


    • About. The About section provides the Processor ID, Processor Type, and Bundle details of the processor, as shown in the following image:


    • Comments. This tab simply provides an area for you to include whatever comments are appropriate for this component.
  2. After you configure a processor, click the Apply button to apply the changes.

Configuring a remote process group in Cloudera Edge Management

Learn how to configure a remote process group using the Edge Flow Manager UI in Cloudera Edge Management.

  1. To configure an RPG, right-click on the RPG and select the Configure option.
    Alternatively, just double-click on the RPG. The Configuration dialog opens as shown in the following image:


    The Configuration dialog contains the following two sections:
    • Settings
    • About. The About section provides the Remote Process Group ID.
  2. Configure the following properties in Settings section:
    Properties Description
    URL Allows you to change the URL of the RPG.
    Transport Protocol There are two options for transport protocol:
    • RAW. This is the default protocol which uses raw socket communication by using a dedicated port.
    • HTTP. The HTTP transport protocol is useful if the remote NiFi instance is in a restricted network that only allows access through HTTP(S) protocol or only accessible from a specific HTTP Proxy server.
    Local Network Interface In some cases, it might be desirable to prefer one network interface over another. For example, if a wired interface and a wireless interface exist, the wired interface might be preferred. This can be configured by specifying the name of the network interface to use in this box. If the value entered is not valid, the Remote Process Group will not be valid and will not communicate with other NiFi instances until this is resolved.
    HTTP Proxy Server Hostname Specify the host name of the proxy server, if you select HTTP transport protocol.
    HTTP Proxy Server Port Specify the port number of the proxy server, if you select HTTP transport protocol.
    Communications Timeout When communication with the RPG takes longer than this amount of time, it will timeout. The default value is 30 seconds.
    Yield Duration When communication with the RPG fails, it will not be scheduled again until this amount of time elapses. The default value is 10 seconds.
  3. After you configure an RPG, apply the changes by clicking the Apply button.

Adding services in Cloudera Edge Management

Services are shared services that can be used by processors and other services to utilize for configuration or task execution. Learn how to add services using the Edge Flow Manager UI in Cloudera Edge Management.

  1. To add a service, click the SERVICES button at the bottom-left corner of the canvas, or simply right-click on the canvas and select Services.
    The Services window opens as shown in the following image:


  2. Click the ADD SERVICE button.
    The Add Service dialog opens. It provides a list of the available services as shown in the following image:


  3. Select the service you want to add and click Add, or simply double-click on the name of the service to add it.
    You can also use the Filter items field at the top-right corner of the window to search for the desired service by name.
  4. After you add a service, configure it by clicking the Configure icon () in the far-right column.
    The Configuration dialog opens as shown in the following image:


    The Configuration dialog contains the following sections:
    • Settings. The Settings section provides a place for you to give the service a unique name. The name of a service by default is the same as the service type.
    • Properties. The Properties section lists the various properties that apply to the particular service. You can hover over the question mark icons with the mouse to see more information about each property.
    • About. The About section provides the Service ID, Service Type, and Bundle details of the service.
    • Comments. The Comments section is just an open-text field, where you can include comments about the service.
  5. After you configure a service, click the Apply button to apply the configuration

    If you want to delete a service, click the trash icon () in the far-right column. To return to the canvas, click the BACK TO FLOW link.

Connecting components in Cloudera Edge Management

After you add processors and other components to the canvas of the Edge Flow Manager UI and configure them, the next step is to connect them to one another. This is accomplished by creating a connection between each component.

  1. Hover the mouse over a component.
    An arrow appears as shown in the following image:


  2. Drag the arrow from one component to another until the second component is highlighted, then release the mouse.
    A Create Connection dialog appears as shown in the following image:


    The dialog allows you to choose the Source Relationship that must be included in the connection. At least one relationship must be selected. If only one relationship is available, it is automatically selected.
  3. Select Add to create the connection.

Configuring a connection in Cloudera Edge Management

After you create a connection, you can change the configuration properties of the connection or move the connection using the Edge Flow Manager UI in Cloudera Edge Management.

  1. To change the configuration of a connection, right-click on the connection and select the Configure option, or double-click on the connection.
    The Configuration dialog opens as shown in the following image:


    The Configuration dialog contains the following two sections:
    • Settings
    • About. The About section provides the Connection ID.
  2. Configure the following properties in the Settings section:
    Property Description
    Source Relationship Allows you to change the Source Relationships of the connection.
    Flowfile Expiration FlowFile expiration is a concept by which data that cannot be processed in a timely fashion can be automatically removed from the flow. This is useful, for example, when the volume of data is expected to exceed the volume that can be sent to a remote site. The expiration period is based on the time that the data entered the MiNiFi instance. In other words, if the file expiration on a given connection is set to 1 hour, and a file that has been in the MiNiFi instance for one hour reaches that connection, it will expire. The default value is 60 seconds. A value of 0 seconds indicates that the data will never expire.
    Back Pressure Object Threshold This is the number of FlowFiles that can be in the queue before back pressure is applied. The default value is 0.
    Back Pressure Size Threshold This specifies the maximum amount of data (in size) that must be queued up before applying back pressure. The default value is 10,000 Bytes.
    Connection Name This field allows you to change the name of the connection. It is blank by default.
  3. After you configure a connection, click the Apply button to apply the changes.

Bending connections in Cloudera Edge Management

Learn how to bend a connection using the Edge Flow Manager UI in Cloudera Edge Management.

  1. To add a bend point (or elbow) to an existing connection, simply double-click on the connection in the spot where you want the bend point to be.
  2. Use the mouse to grab the bend point and drag it so that the connection is bent in the desired way.
    The following image shows a bend point in the connection between GenerateFlowFile and LogAttribute processors:


    You can add as many bend points as you want. You can also use the mouse to drag and move the label on the connection to any existing bend point. To remove a bend point, simply double-click on it again.