Build your draft flow

Start building your draft flow by adding components to the Canvas and configuring them.

  1. Add an InvokeHTTP processor to the canvas.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the text box, filter for InvokeHTTP.
    3. Change the Processor Name to Get Recent Wikipedia Changes.
    4. Click Add.

    After configuration, this processor calls the Wikipedia API to retrieve the latest changes.

  2. Configure the Get Recent Wikipedia Changes processor.
    Properties
    HTTP URL
    provide https://en.wikipedia.org/w/api.php?action=query&list=recentchanges&format=json&rcprop=user%7Ccomment%7Cparsedcomment%7Ctimestamp%7Ctitle%7Csizes%7Ctags
    Relationships

    Select the following relationships:

    • Original - Terminate
    • Failure - Terminate, Retry

    • Retry - Terminate

    • No Retry - Terminate

  3. Click Apply.
  4. Add a ConvertRecord processor to the canvas.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the text box, filter for ConvertRecord.
    3. Change the Processor Name to Convert JSON to AVRO.
    4. Click Add.

    This processor converts the JSON response to AVRO format. It uses RecordReaders and RecordWriters to accomplish this. It infers the JSON schema starting from the recent changes field.

  5. Configure the Convert JSON to AVRO processor.

    Properties

    Record Reader
    Select the JSON_Reader_Recent_Changes controller service you have created from the drop-down list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you have created from the drop-down list.
    Relationships

    Select the following relationships:

    failure - Terminate, Retry

  6. Click Apply.
  7. Connect the Get Recent Wikipedia Changes and Convert JSON to AVRO processors by hovering over the lower-right corner of the Get Recent Wikipedia Changes processor, clicking the arrow that appears and dragging it to the Convert JSON to AVRO processor.


  8. In the configuration popup, select the Response relationship and click Add.




  9. Add a QueryRecord processor.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the text box, filter for QueryRecord.
    3. Name it Filter Edits.
    4. Click Add.

    This processor filters out anything but actual page edits. To achieve this, it's running a query that selects all FlowFiles (events) of the type edit.

  10. Configure the Filter Edits processor.
    Properties
    Record Reader
    Select the AvroReader_Recent_Changes controller service you have created from the drop-down list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you have created from the drop-down list.
    Relationships

    Select the following relationship:

    • failure - Terminate
    • original - Terminate

  11. For the Filter Edits processor you also need to add a user-defined property. Click Add Property.
    1. Provide Filtered edits as Name
    2. Provide Select * from FLOWFILE where type='edit' as Value.
    3. Click Apply.
  12. Connect the Convert JSON to AVRO and Filter Edits processors by hovering over the lower-right corner of the Convert JSON to AVRO processor, clicking the arrow that appears and dragging it to the Filter Edits processor.
  13. In the configuration pane, select the success and failure relationships and click Add.
  14. Add a second QueryRecord processor.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the text box, filter for QueryRecord.
    3. Name it Route on Content Size.
    4. Click Add.

    This processor uses two SQL statements to separate edit events that resulted in a longer article from edit events that resulted in a shorter article.

  15. Configure the Route on Content Size processor.
    Properties
    Record Reader
    Select the AvroReader_Recent_Changes controller service you have created from the drop-down list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you have created from the drop-down list.
    Relationships

    Select the following relationships:

    • failure - Terminate, Retry

    • original - Terminate

  16. For the Route on Content Size processor you also need to add two user-defined properties.
    1. Click Add Property.
    2. Provide Added content as Name
    3. Provide Select * from FLOWFILE where newlen>=oldlen as Value.
    4. Click Add.
    5. Click Add Property, to create the second property.
    6. Provide Removed content as Name.
    7. Provide Select * from FLOWFILE where newlen<oldlen as Value.
    8. Click Add.
    9. Click Apply.
  17. Connect the Filter Edits and Route on Content Size processors by hovering over the lower-right corner of the Filter Edits processor, clicking the arrow that appears and drawing it to Route on Content Size.
    In the Create Connection pop up select the Filtered edits relation and click Add.
  18. Add two MergeRecord processors.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the text box, filter for MergeRecord.
    3. Name it Merge Edit Events.
    4. Click Add.
    5. Repeat the above steps to add another identical processor.

    These processors are configured to merge at least 100 records into one flowfile to avoid writing lots of small files. The MaxBinAge property is set to 2 minutes which makes the processors merge records after two minutes even if less than 100 records have arrived.

  19. Configure the two Merge Edit Events processors.

    Properties

    Record Reader
    Select the AvroReader_Recent_Changes controller service you have created from the drop-down list.
    Record Writer
    Select the AvroWriter_Recent_Changes controller service you have created from the drop-down list.
    Max Bin Age
    Set to two minutes by providing a value of 2 min.
    Relationships

    Select the following relationships:

    • failure - Terminate

    • original - Terminate

  20. Connect the Route on Content Size processor to both of the Merge Edit Events processors.
    1. For the first Merge Edit Events processor, select Added content from Relationships and click Add.
    2. For the second Merge Edit Events processor, select Removed content from Relationships and click Add.


  21. Add two PutFile processors to the canvas.
    1. Drag a Processor from the Components sidebar to the canvas.
    2. In the text box, filter for PutFile.
    3. Name it Write "Added Content" Events To File.
    4. Click Add.
    5. Repeat the above steps to add another identical processor, naming this second PutFile processor Write "Removed Content" Events To File.

    These processors write the filtered, routed edit events to two different locations on the local disk. In Cloudera DataFlow, you would typically not write to local disk but replace these processors with processors that resemble your destination system (Kafka, Database, Object Store etc.)

  22. Configure the Write "Added Content" Events To File processor.
    Properties:
    Directory
    Provide /tmp/larger_edits
    Maximum File Count
    Provide 500
    Relationships

    Select the following relationships:

    • SUCCESS - Terminate

    • failure - Terminate

  23. Click Apply.
  24. Configure the Write "Removed Content" Events To File processor.
    Properties:
    Directory
    Provide /tmp/smaller_edits
    Maximum File Count
    Provide 500
    Relationships

    Select the following relationships:

    • SUCCESS - Terminate

    • failure - Terminate

  25. Click Apply.
  26. Connect the Merge Edit Events processor with the Added content connection to the Write "Added Content" Events To File processor.
    In the Create Connection pop up select merged and click Add.
  27. Connect the Merge Edit Events processor with the Removed content connection to the Write "Removed Content" Events To File processor.
    In the Create Connection pop up select merged and click Add.
Congratulations, you have created your first draft flow. Now proceed to testing it by launching a Test Session.