Sizing recommendations

Properly sizing your Cloudera Flow Management clusters is crucial for optimal performance. Follow these recommendations to configure your environment effectively.

On-remises (bare metal) installations

Cloudera recommends the following disk setup for bare metal environments:

  • 1 RAID 1 or 10 array for the operating system (OS)

  • 1 RAID 1 or 10 array for the FlowFile repository

  • 1 or many RAID 1 or 10 array(s) for the content repository

  • 1 or many RAID 1 or 10 array(s) for the provenance repository

For high performance setup, Cloudera recommends SSDs over spinning disks.

Cloud environments

For cloud deployments, larger disks typically offer better throughput. Review your cloud provider’s documentation for specific details and best practices.

Memory considerations

NiFi efficiently processes FlowFiles of any size by avoiding direct memory materialization. Instead, NiFi uses input and output streams to process events (there are a few exceptions with some specific processors). This means that NiFi does not require significant memory even if it is processing very large files. Most system memory should remain available for the OS cache, enabling disk read optimizations. By having a large enough OS cache, many of the disk reads are skipped completely. So unless NiFi is used for very specific memory oriented data flows, setting the Java heap to 8 GB or 16 GB is usually sufficient.

Performance and scalability

The performance you can expect directly depends on the hardware and the flow design. For example, when reading compressed data from a cloud object store, decompressing the data, filtering it based on specific values, compressing the filtered data, and sending it to a cloud object store, you can achieve the following results:

Nodes Data rate per second Events per second Data rate per day Events per day
1 192.5 MB 946,000 16.6 TB 81.7 billion
5 881 MB 4.97 million 76 TB 429.4 billion
25 5.8 GB 26 million 501 TB 2.25 trillion
100 22 GB 90 million 1.9 PB 7.8 trillion
150 32.6 GB 141.3 million 2.75 PB 12.2 trillion

These metrics were collected on Google Kubernetes Engine (GKE) with each node configured with 32 cores, 15 GB RAM, and a 2 GB heap. The content repository used a 1 TB Persistent SSD (400 MB/s write, 1200 MB/s read).

NiFi supports both vertical and horizontal scaling. Depending on the number of data flows running in the NiFi cluster and your operational requirements, you can add nodes to the NiFi cluster over time to meet your needs.

With this information in mind, Cloudera recommends:

  • CPU: At least 4 cores per NiFi node (8 cores are preferred for common use cases).
  • Disks: At least 6 disks per NiFi node to dedicate separate disks for repositories.
  • Memory: At least 4 GB of RAM for the NiFi heap.

For more details and a use case example, see Processing one billion events per second with NiFi.