Storage configuration
Learn about storage configuration, available storage types, and storage configuration recommendations for Kafka in Cloudera Streams Messaging Operator for Kubernetes.
spec.storage property. The following configuration snippet defines a 100 GB
persistent storage with the default storage class for Kafka in a
KafkaNodePool resource. The deleteClaim property
specifies if the persistent volume claim has to be deleted when the cluster is
un-deployed.#...
kind: KafkaNodePool
spec:
storage:
type: persistent-claim
size: 100Gi
deleteClaim: trueCloudera Streams Messaging Operator for Kubernetes supports multiple types of storage depending on the platform. The supported storage types are as follows:
- Ephemeral
- Persistent
- JBOD (Just a Bunch of Disks)
The storage type is configured with storage.type. The property
accepts three values, ephemeral, persistent-claim, and
jbod. Each value corresponds to its respective storage type.
The following sections provide a more in-depth look at each storage type, and collect Cloudera recommendations on storage.
Ephemeral storage
Ephemeral storage is retained only for the lifetime of a pod and is lost when the pod is deleted. It is not suitable for production and should only be used for development or test clusters.
storage.type to
ephemeral.#...
kind: KafkaNodePool
spec:
storage:
type: ephemeralThe available configuration options are listed in the Strimzi documentation.
Persistent storage
Persistent storage preserves data across system disruptions. Cloudera recommends that you use persistent storage for production environments. When using this configuration, a single persistent storage volume is defined.
To use persistent storage, set storage.type to
persistent-claim.
#...
kind: KafkaNodePool
spec:
storage:
type: persistent-claim
Custom storage classes
storage.class.The following example configures a custom storage class for the pods in the cluster which it is configured for.
#...
kind: KafkaNodePool
spec:
storage:
type: persistent-claim
class: custom-storage-class
If you want to configure storage classes on a per-broker basis, deploy multiple KafkaNodePool resources with a different storage class each.
JBOD storage
Just a bunch of disks (JBOD) refers to a system configuration where disks are used independently rather than organizing them into redundant arrays. JBOD storage allows you to configure your Kafka cluster to use multiple volumes. This approach provides increased data storage capacity for Kafka nodes, and can lead to performance improvements. A JBOD configuration is defined by one or more volumes, each of which can be either ephemeral or persistent.
To use JBOD storage, set the storage.type to jbod and specify
the volumes.
The following example uses a jbod storage type with two attached persistent
volumes. The volumes must all be identified by a unique ID.
#...
kind: KafkaNodePool
spec:
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 100Gi
deleteClaim: false
- id: 1
type: persistent-claim
size: 100Gi
deleteClaim: false
You can always increase or decrease the number of disks or increase the volume size by modifying the KafkaNodePool resource and reapplying the changes. However, you cannot change the IDs once volumes are created.
The available configuration options are listed in the Strimzi documentation.
Storage recommendations
Cloudera recommends using persistent storage to store Kafka data. Ephemeral storage is only suitable for short-lived test clusters. Use a dynamic provisioner storage class with block storage (ReadWriteOnce access) and prefer SSD or NVMe disks.
Consider the following when using persistent storage.
Local storage
Using local storage makes the deployment similar to a bare-metal deployment in terms of scheduling and availability. It provides good throughput as storage operations have less overhead when replication and network hops are not necessary.
However, the Kafka pods become bound to the node where the backing volume is located. This means that the pods cannot be scheduled to a different node, which impacts availability
Distributed storage
Using distributed storage with synchronous replication allows leveraging the flexibility of Kubernetes pod scheduling. Kafka pods can be migrated across nodes due to the availability of the same storage on different nodes. This improves the availability of the Kafka cluster. Node failures do not bring down Kafka brokers permanently.
However, distributed storage reduces throughput in the Kafka cluster. The synchronous replication of storage adds extra overhead to disk writes. Additionally, if the backing storage class does not support data locality, reads and writes require extra network hops.
