Build your draft flow

Begin creating your draft flow by adding components to the canvas and setting them up.

  1. Add an InvokeHTTP processor to the canvas.

    Once configured, this processor will call the Wikipedia API to fetch the latest changes.

    1. Drag a Processor from the Components sidebar to the canvas.


    2. In the Search field, filter for InvokeHTTP.


    3. Rename the InvokeHTTP processor by changing the Processor Name to Get Recent Wikipedia Changes.
    4. Click the Add button.
  2. Configure the Get Recent Wikipedia Changes processor.
    Properties
    HTTP URL
    Enter the following URL:
    https://en.wikipedia.org/w/api.php?action=query&list=recentchanges&format=json&rcprop=user%7Ccomment%7Cparsedcomment%7Ctimestamp%7Ctitle%7Csizes%7Ctags


    Relationships

    Select the following relationships:

    • Original –Terminate
    • Failure – Terminate, Retry

    • Retry – Terminate

    • No Retry – Terminate

  3. Click the Apply button.
  4. Add a ConvertRecord processor to the canvas.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the Search field, filter for ConvertRecord.
    3. Change the Processor Name to Convert JSON to AVRO.
    4. Click the Add button.

    This processor converts the JSON response to AVRO format by using RecordReaders and RecordWriters. It infers the JSON schema starting from the recent changes field.

  5. Configure the Convert JSON to AVRO processor.

    Properties

    Record Reader
    Select the JSON_Reader_Recent_Changes controller service you created earlier from the dropdown list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you created earlier from the dropdown list.
    Relationships

    Select the following relationships:

    Failure - Terminate, Retry

  6. Click the Apply button.
  7. Connect the Get Recent Wikipedia Changes and Convert JSON to AVRO processors by hovering over the lower-right corner of the Get Recent Wikipedia Changes processor, clicking the arrow that appears and dragging it to the Convert JSON to AVRO processor.


  8. In the configuration popup, select the Response relationship and click the Add button.




  9. Add a QueryRecord processor.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the Search field, search for QueryRecord.
    3. Change the processor Name to Filter Edits.
    4. Click the Add button.

    This processor filters out anything except actual page edits. To achieve this, the processor runs a query that selects all FlowFiles (events) of the edit type.

  10. Configure the Filter Edits processor.
    Properties
    Record Reader
    Select the AvroReader_Recent_Changes controller service you have created from the dropdown list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you have created from the dropdown list.
    Relationships

    Select the following relationships:

    • Failure - Terminate
    • Original - Terminate

  11. For the Filter Edits processor you also must add a user-defined property. Click the Add Property button.
    1. Provide the Name as Filtered edits.
    2. Provide the Value as Select * from FLOWFILE where type='edit'.
    3. Click the Apply button.
  12. Connect the Convert JSON to AVRO and Filter Edits processors by hovering over the lower-right corner of the Convert JSON to AVRO processor, clicking the arrow that appears and dragging it to the Filter Edits processor.
  13. In the configuration pane, select the Success and Failure relationships and click the Add button.
  14. Add a second QueryRecord processor.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the Search field, filter for QueryRecord.
    3. Change the processor Name to Route on Content Size.
    4. Click the Add button.

    This processor uses two SQL statements to separate edit events that resulted in a longer article from edit events that resulted in a shorter article.

  15. Configure the Route on Content Size processor.
    Properties
    Record Reader
    Select the AvroReader_Recent_Changes controller service you have created from the dropdown list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you have created from the dropdown list.
    Relationships

    Select the following relationships:

    • Failure - Terminate, Retry

    • Original - Terminate

  16. For the Route on Content Size processor you also must add two user-defined properties.
    1. Click the Add Property button.
    2. Provide the Name as Added content.
    3. Provide the Value as Select * from FLOWFILE where newlen>=oldlen.
    4. Click the Add button.
    5. Click the Add Property button, to create the second property.
    6. Provide the Name as Removed content.
    7. Provide the Value as Select * from FLOWFILE where newlen<oldlen .
    8. Click the Add button.
    9. Click the Apply button.
  17. Connect the Filter Edits and Route on Content Size processors by hovering over the lower-right corner of the Filter Edits processor, clicking the arrow that appears and drawing it to Route on Content Size.
    In the Create Connection pop up, select the Filtered edits relation and click Add.
  18. Add two MergeRecord processors.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the Search field, filter for MergeRecord.
    3. Change the processor Name to Merge Edit Events.
    4. Click the Add button.
    5. Repeat steps a. to d. to add another identical processor.

    These processors are configured to merge at least 100 records into one flowfile to avoid writing lots of small files. The MaxBinAge property is set to 2 minutes, which makes the processors merge records after two minutes even if less than 100 records have arrived.

  19. Configure the two Merge Edit Events processors.

    Properties

    Record Reader
    Select the AvroReader_Recent_Changes controller service you have created from the dropdown list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you have created from the dropdown list.
    Max Bin Age
    Set to two minutes by providing a value of 2 min.
    Relationships

    Select the following relationships:

    • Failure - Terminate

    • Original - Terminate

  20. Connect the Route on Content Size processor to both of the Merge Edit Events processors.
    1. For the first Merge Edit Events processor, select Added content value from the Relationships options and click the Add button.
    2. For the second Merge Edit Events processor, select the Removed content value from the Relationships option and click the Add button.


  21. Add two PutFile processors to the canvas.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the Search field, filter for PutFile.
    3. Change the processor Name to Write "Added Content" Events To File.
    4. Click the Add button.
    5. Repeat steps a. to d. to add another identical processor, naming this second PutFile processor Write "Removed Content" Events To File.

    These processors write the filtered, routed edit events to two different locations on the local disk. In Cloudera Data Flow, you typically do not write to local disk but replace these processors with processors that resemble your destination system, such as Kafka, Database, or Object Store.

  22. Configure the Write "Added Content" Events To File processor.
    Properties:
    Directory
    Set to /tmp/larger_edits.
    Maximum File Count
    Set to 500.
    Relationships

    Select the following relationships:

    • Failure - Terminate

    • Success - Terminate

  23. Click the Apply button.
  24. Configure the Write "Removed Content" Events To File processor.
    Properties:
    Directory
    Set to /tmp/smaller_edits.
    Maximum File Count
    Set to 500.
    Relationships

    Select the following relationships:

    • Failure - Terminate

    • Success - Terminate

  25. Click the Apply button.
  26. Connect the Merge Edit Events processor with the Added content connection to the Write "Added Content" Events To File processor.
    In the Create Connection modal select merged and click the Add button.
  27. Connect the Merge Edit Events processor with the Removed content connection to the Write "Removed Content" Events To File processor.
    In the Create Connection modal select merged and click the Add button.
Congratulations, you have created your first draft flow. Now proceed to testing it by launching a Test Session.