Build your draft flow
Begin creating your draft flow by adding components to the canvas and setting them up.
-
Add an InvokeHTTP processor to the canvas.
Once configured, this processor will call the Wikipedia API to fetch the latest changes.
-
Drag a
Processor from the Components sidebar to the canvas.
-
In the
Search
field,
filter
for InvokeHTTP.
- Rename the InvokeHTTP processor by changing the Processor Name to Get Recent Wikipedia Changes.
- Click the Add button.
-
Drag a
-
Configure the Get Recent Wikipedia Changes
processor.
Properties
- HTTP URL
- Enter
the
following
URL:
https://en.wikipedia.org/w/api.php?action=query&list=recentchanges&format=json&rcprop=user%7Ccomment%7Cparsedcomment%7Ctimestamp%7Ctitle%7Csizes%7Ctags
RelationshipsSelect the following relationships:
- Original –Terminate
-
Failure – Terminate, Retry
-
Retry – Terminate
-
No Retry – Terminate
-
Click the
Apply
button.
-
Add a ConvertRecord processor to the canvas.
-
Drag a
Processor from the Components sidebar to the canvas.
- In the Search field, filter for ConvertRecord.
- Change the Processor Name to Convert JSON to AVRO.
- Click the Add button.
This processor converts the JSON response to AVRO format by using RecordReaders and RecordWriters. It infers the JSON schema starting from the recent changes field.
-
Drag a
-
Configure the Convert JSON to AVRO processor.
Properties
- Record Reader
- Select the JSON_Reader_Recent_Changes controller service you created earlier from the dropdown list.
- Record Writer
- Select the AvroWriter_Recent_Changes controller service you created earlier from the dropdown list.
RelationshipsSelect the following relationships:
Failure - Terminate, Retry
- Click the Apply button.
-
Connect the Get Recent Wikipedia Changes and
Convert JSON to AVRO processors by hovering over the
lower-right corner of the Get Recent Wikipedia Changes
processor, clicking the arrow that appears and dragging it to the
Convert JSON to AVRO processor.
-
In the configuration popup, select the Response
relationship and click
the
Add
button.
-
Add a QueryRecord processor.
-
Drag a
Processor from the Components sidebar to the canvas.
- In the Search field, search for QueryRecord.
-
Change the processor Name to
Filter Edits
. - Click the Add button.
This processor filters out anything except actual page edits. To achieve this, the processor runs a query that selects all FlowFiles (events) of the
edit
type. -
Drag a
-
Configure the Filter Edits processor.
Properties
- Record Reader
- Select the AvroReader_Recent_Changes controller service you have created from the dropdown list.
- Record Writer
- Select the AvroWriter_Recent_Changes controller service you have created from the dropdown list.
RelationshipsSelect the following relationships:
- Failure - Terminate
-
Original - Terminate
-
For the Filter Edits processor you also
must
add a user-defined property.
Click the
Add Property button.
- Provide the Name as Filtered edits.
- Provide the Value as Select * from FLOWFILE where type='edit'.
- Click the Apply button.
- Connect the Convert JSON to AVRO and Filter Edits processors by hovering over the lower-right corner of the Convert JSON to AVRO processor, clicking the arrow that appears and dragging it to the Filter Edits processor.
- In the configuration pane, select the Success and Failure relationships and click the Add button.
-
Add a second QueryRecord processor.
-
Drag a
Processor from the Components sidebar to the canvas.
- In the Search field, filter for QueryRecord.
- Change the processor Name to Route on Content Size.
- Click the Add button.
This processor uses two SQL statements to separate edit events that resulted in a longer article from edit events that resulted in a shorter article.
-
Drag a
-
Configure the Route on Content Size processor.
Properties
- Record Reader
- Select the AvroReader_Recent_Changes controller service you have created from the dropdown list.
- Record Writer
- Select the AvroWriter_Recent_Changes controller service you have created from the dropdown list.
RelationshipsSelect the following relationships:
-
Failure - Terminate, Retry
-
Original - Terminate
-
For the Route on Content Size processor you also
must
add two user-defined properties.
- Click the Add Property button.
- Provide the Name as Added content.
- Provide the Value as Select * from FLOWFILE where newlen>=oldlen.
- Click the Add button.
- Click the Add Property button, to create the second property.
- Provide the Name as Removed content.
- Provide the Value as Select * from FLOWFILE where newlen<oldlen .
- Click the Add button.
- Click the Apply button.
-
Connect the Filter Edits and Route on Content
Size processors by hovering over the lower-right corner of the
Filter Edits processor, clicking the arrow that
appears and drawing it to Route on Content Size.
In the Create Connection pop up, select the Filtered edits relation and click Add.
-
Add two
MergeRecord
processors.
-
Drag a
Processor from the Components sidebar to the canvas.
- In the Search field, filter for MergeRecord.
- Change the processor Name to Merge Edit Events.
- Click the Add button.
- Repeat steps a. to d. to add another identical processor.
These processors are configured to merge at least 100 records into one flowfile to avoid writing lots of small files. The MaxBinAge property is set to 2 minutes, which makes the processors merge records after two minutes even if less than 100 records have arrived.
-
Drag a
-
Configure the two Merge Edit Events processors.
Properties
- Record Reader
- Select the AvroReader_Recent_Changes controller service you have created from the dropdown list.
- Record Writer
- Select the AvroWriter_Recent_Changes controller service you have created from the dropdown list.
- Max Bin Age
- Set to two minutes by providing a value of 2 min.
RelationshipsSelect the following relationships:
-
Failure - Terminate
-
Original - Terminate
-
Connect the Route on Content Size processor to both of
the Merge Edit Events processors.
- For the first Merge Edit Events processor, select Added content value from the Relationships options and click the Add button.
- For the second Merge Edit Events processor, select the Removed content value from the Relationships option and click the Add button.
-
Add two PutFile processors to the canvas.
-
Drag a
Processor from the Components sidebar to the canvas.
- In the Search field, filter for PutFile.
- Change the processor Name to Write "Added Content" Events To File.
- Click the Add button.
- Repeat steps a. to d. to add another identical processor, naming this second PutFile processor Write "Removed Content" Events To File.
These processors write the filtered, routed edit events to two different locations on the local disk. In Cloudera Data Flow, you typically do not write to local disk but replace these processors with processors that resemble your destination system, such as Kafka, Database, or Object Store.
-
Drag a
-
Configure the Write "Added Content" Events To File
processor.
Properties:
- Directory
- Set to /tmp/larger_edits.
- Maximum File Count
- Set to 500.
RelationshipsSelect the following relationships:
-
Failure - Terminate
-
Success - Terminate
- Click the Apply button.
-
Configure the Write "Removed Content" Events To File
processor.
Properties:
- Directory
- Set to /tmp/smaller_edits.
- Maximum File Count
- Set to 500.
RelationshipsSelect the following relationships:
-
Failure - Terminate
-
Success - Terminate
- Click the Apply button.
-
Connect the Merge Edit Events processor with the
Added content connection to the Write
"Added Content" Events To File processor.
In the Create Connection modal select merged and click the Add button.
-
Connect the Merge Edit Events processor with the
Removed content connection to the Write
"Removed Content" Events To File processor.
In the Create Connection modal select merged and click the Add button.
