Configuring out of memory recovery

You can optionally specify the step size in memory increase to prevent out of memory (OOM) crashes to your pods. You can also specify an upper bound to memory increase, to prevent infinite scaling.

The Cloudera Flow Management - Kubernetes Operator can detect an Out of Memory event in a NiFi cluster and scale up the memory footprint when configured for Out of Memory Recovery. This feature is not preventative but responsive, the NiFi cluster must first run out of memory and fail a Readiness check before the recovery attempt will be made, potentially impacting Flow performance. OOM Recovery is intended to be a safe guard and is not a replacement for good cluster sizing. If OOM Recovery has triggered, it is recommended that you reevaluate your NiFi resource sizing.

OOM Recovery has two fields to configure: stepSize and upperBound. stepSize defines the amount of memory that should be added for each OOM event. upperBound defines the maximum amount of memory to which the OOM Recovery process is allowed to grow.

spec:
  outOfMemoryRecovery:
    stepSize: [***DEFINES THE MEMORY INCREASE EVERY TIME PODS ARE OOMKILLED***]
    upperBound: [***SPECIFIES THE UPPER LIMIT OF MEMORY INCREASE FOR MEMORY PROTECTION***]

For example:

spec:
  outOfMemoryRecovery:
    stepSize: 1Gi
    upperBound: 8Gi
  resources:
    nifi:
      requests:
        cpu: "1"
        memory: 4Gi   

The above spec starts with NiFi containers at 4Gi and will grow by 1Gi for every OOM that occurs until the NiFi container memory reaches 8Gi. When only memory requests are provided, the NiFi container memory request will grow. If memory limits are provided, only the memory limit will grow.

Note: This can break Quality of Service for the Pod, in the future the requests and limits will grow proportionately.

Once the OOM Recovery has taken effect, it will never automatically scale down. Removal of the OOM Recovery growth will occur when a NiFi resource spec change is detected or when OOM Recovery is removed from the NiFi spec.

NiFi Resource Conditions

The following status field and condition have been added to track the OOM Recovery process:

status:
  conditions:
  - lastTransitionTime: "2025-04-15T16:16:15Z"
    message: NiFi has vertically scaled for OOM recovery
    observedGeneration: 2
    reason: OOMRecoveryScaleUp
    status: "False"
    type: VerticallyScaleUp
  outOfMemoryRecoveryGrowth: 500Mi

The field outOfmemoryRecoveryGrowth tracks how much the NiFi memory has already grown. The VerticallyScaleUp condition provides the last time the cluster scaled up as well as if the scaling action is complete or not. While the status of VerticallyScaleUp is “True”, the scaling is in progress. Once the scaling action is complete, the status is set to “False”.