Disk configuration
When planning how to size and configure Flow Management clusters, it is important to keep in mind the flow design factors that might impact your cluster sizing needs and the performance of your data flow.
For most modern systems, the disk throughput is lower than the network throughput, so the network is usually not a bottleneck. For most data movement use cases, the CPU usage is much lower than the disk I/O, but it is still important to monitor the CPU and tune the number of threads per processor. See Tuning your Data Flow for recommendations about fine tuning threads usage.
NiFi has three repositories on disk and the disk configuration is a very important performance factor:
- Content Repository
-
- Contains the content of each FlowFile
- Sequential disk I/O (ideally leveraging the OS cache)
- FlowFile (metadata) Repository
-
- Contains the FlowFile attributes and current FlowFile state (which queue it is in) for each FlowFile
- Sequential and random disk I/O
- Provenance (metadata) Repository
-
- Contains a provenance log with entries for every action performed on a FlowFile (merge, drop, and so on)
- Sequential and random disk I/O
Every FlowFile that NiFi receives or creates is immediately written to disk in the content repository for fault tolerance. Subsequent FlowFile content modifications (decompression, format conversion, and so on) are also written to the content repository. Processors that do not modify the content, such as RouteOnAttribute, do not impact the content repository. Instead, the FlowFile repository keeps a pointer for each FlowFile showing its state, such as which queue it is located in. This optimization eliminates the need for redundant writes to the content repository.
For higher performance, configure multiple disks for both the content and provenance repositories.
For more information, see Configuration Best Practices and File System Content Repository Properties and Write Ahead Provenance Repository Properties in the NiFi System Properties documentation.