Data Buffering
An important point to keep in mind is that NiFi provides a generic data processing capability. Data can be in any format. Processors are generally scheduled with several threads. A common mistake that developers new to NiFi make is to buffer all the contents of a FlowFile in memory. While there are cases when this is required, it should be avoided if at all possible, unless it is well-known what format the data is in. For example, a Processor responsible for executing XPath against an XML document will need to load the entire contents of the data into memory. This is generally acceptable, as XML is not expected to be extremely large. However, a Processor that searches for a specific byte sequence may be used to search files that are hundreds of gigabytes or more. Attempting to load this into memory can cause a lot of problems - especially if multiple threads are processing different FlowFiles simultaneously.
Instead of buffering this data into memory, it is advisable to instead evaluate the
data as it is streamed from the Content Repository (i.e., scan the content from the
InputStream
that is provided to your callback by
ProcessSession.read
). Of course, in this case, we don't want to
read from the Content Repository for each byte, so we would use a BufferedInputStream or
somehow buffer some small amount of data, as appropriate.