Rolling restart checks

You can configure Cloudera Manager to perform checks during a rolling restart to ensure that Kafka roles stay healthy. Rolling restart checks are available for both Kafka brokers and KRaft controllers, with each configured independently to provide different levels of health guarantees.

Rolling restart checks are enabled by default to ensure cluster health during rolling restarts. For Kafka brokers, the default check ensures that healthy partitions stay healthy during restarts. For KRaft controllers, the default check ensures that the majority of controllers remain online during restarts. These checks help prevent service disruptions that could occur if roles are restarted without verifying cluster health. For example, without checks, Cloudera Manager might restart a broker while the previous broker is not fully ready for operation, causing outages and corrupted log indexes. Similarly, for KRaft controllers, restarting without proper checks can disrupt the metadata quorum and impact cluster operations.

Broker and controller rolling restart checks are configured independently using separate properties. This allows you to disable, enable, or configure different check levels for each type based on your cluster requirements.

Rolling restart checks are available for both Kafka brokers and KRaft controllers:

Kafka broker checks are configured using the Cluster Health Guarantee During Rolling Restart property. These checks focus on partition health and replica synchronization to ensure that topics remain accessible and data is not lost during broker restarts.
KRaft controller checks are configured using the KRaft Cluster Health Guarantee During Rolling Restart property. These checks focus on maintaining metadata quorum health to ensure that the majority of controllers remain available during controller restarts. These checks are only applicable when KRaft is used as the metadata store.

Rolling restart checks for Kafka brokers

Kafka broker rolling restart checks focus on partition health and replica synchronization. There are multiple checks available, each providing a different level of guarantee on Kafka cluster and broker health. The type of check performed is configured with the Cluster Health Guarantee During Rolling Restart property.

The property has five different settings, each setting corresponds to a different type of check. The available settings and the check types that the settings correspond to are as follows:

none: This setting disables rolling restart checks. If this option is selected, no checks are performed and no health guarantees are provided.
ready for request: This setting ensures that when a broker is restarted, the restarted broker is accepting and responding to requests made on its service port. The next broker is only restarted after the previous broker is ready for requests.
healthy partitions stay healthy (default): This setting ensures that no partitions go into an under-min-isr state when a broker is stopped. This is achieved by waiting before each broker is stopped so that all other brokers can catch up with all replicas that are in an at-min-isr state. Additionally, this setting ensures that the restarted broker is accepting and is responding to requests made on its service port before restarting the next broker. This setting ignores partitions which are already in an under-min-isr state.
all partitions stay healthy (recommended): This setting ensures that no partitions are in an under-min-isr or at-min-isr state when a broker is stopped. This is achieved by waiting before each broker is stopped so that all other brokers can catch up with all replicas that are in an at-min-isr or under-min-isr state. Additionally, this setting ensures that the restarted broker is accepting requests on its service port before the next broker is restarted.
all partitions fully replicated: This setting ensures that all partitions are in a fully synchronized state when a broker is stopped. This is achieved by waiting before each broker is stopped so that all other brokers can catch up with all replicas that are out of sync. Additionally, this setting ensures that the restarted broker is accepting requests on its service port before the next broker is restarted.

When Cloudera Manager executes a broker rolling restart check, it uses the kafka-topics tool to gather information about the brokers, topics, and partitions. The kafka-topics tool requires a valid client configuration file to run. In the case of rolling restart checks, two configuration files are required. One for the kafka-topics commands that are initiated before a broker is stopped, and a separate one for the commands initiated after a broker is restarted. Cloudera Manager automatically generates these client configuration files based on the configuration of the Kafka service. These files can also be manually updated using advanced security snippets.

Using these files, Cloudera Manager executes kafka-topics commands on the brokers. Based on the response from the tool, Cloudera Manager either waits for a specified amount of time or continues with the rolling restart.

Depending on what type of check is configured, Cloudera Manager polls information with kafka-topics at different points in time. As a result, the checks can be categorised in two groups. Pre-checks and post-checks. If either healthy partitions stay healthy or all partitions stay healthy is selected, information is polled both before a broker is stopped (pre-check) and after a broker is restarted (post-check). If the ready for request setting is selected, information is only polled after a broker is restarted.

If a pre-check fails to find a proper state when a broker can be stopped, the check will stop the entire rolling restart process. This can happen if the broker that is about to be stopped still has at-min-isr or under-min-isr partitions after the configured timeout interval is reached. Post-checks behave in a similar way. If the post-check fails to receive validation (a correct exit code) within the specified timeout interval from the kafka-topics command that the broker is ready for requests, the check will stop the entire rolling restart process. In both of these cases the brokers are not stopped or restarted. The rolling restart fails and the brokers continue to run.

In addition to configuring and enabling these checks using Cluster Health Guarantee During Rolling Restart, a number of other configuration properties are also available that enable you to fine-tune the behaviour of the checks. For detailed steps on how to enable and configure rolling restart checks, see Configuring rolling restart checks.

Rolling restart checks for KRaft controllers

KRaft controller rolling restart checks are configured independently from Kafka broker rolling restart checks using the KRaft Cluster Health Guarantee During Rolling Restart property. KRaft controller rolling restart checks only apply when KRaft is used as the metadata store for the Kafka service.

The KRaft controller rolling restart checks ensure that the KRaft quorum remains healthy during a rolling restart. The checks verify that the majority of controllers (half + 1) remain available and operational throughout the restart process. Unlike broker checks which focus on partition health, controller checks focus on maintaining metadata quorum health.

The KRaft Cluster Health Guarantee During Rolling Restart property has two settings:

none: This setting disables KRaft controller rolling restart checks. If this option is selected, no checks are performed on controllers during a rolling restart.
majority of controllers online (default): This setting ensures that during a rolling restart, the majority of KRaft controllers remain online and operational. This is the recommended setting to maintain metadata availability during restarts.

KRaft controller rolling restart checks use the kafka-metadata-quorum command line tool to verify quorum health. The checks are performed in two phases:

Pre-check (before stopping a controller): Before stopping a controller, Cloudera Manager verifies that the majority of the KRaft quorum will remain available after the controller is stopped. The check counts the number of active controllers (controllers with low lag and recent metadata fetch timestamps) and ensures that at least half + 1 controllers will remain active after stopping the current controller. If the majority requirement cannot be met, the check waits and retries until the requirement is satisfied or a timeout is reached.
Post-check (after starting a controller): After restarting a controller, Cloudera Manager verifies that the controller is actively participating in metadata updates. The check monitors the controller's LastFetchTimestamp and ensures it is increasing, which indicates the controller is successfully fetching and processing metadata updates from the quorum leader. The check succeeds when the timestamp increases.

A controller is considered active if it meets the following criteria:

Its lag (number of uncommitted metadata messages) is below the configured threshold (default: 5 messages)
Its LastFetchTimestamp is recent (within the configured threshold, default: 5 seconds)
Its status is either Leader or Follower (not Observer)

In addition to the main KRaft Cluster Health Guarantee During Rolling Restart property, several configuration properties are available to fine-tune the behaviour of KRaft controller rolling restart checks:

Maximum Allowed Runtime for Kafka Controller Rolling Restart Checks - Specifies the overall timeout (default: 15 minutes)
Retry Interval for Kafka Controller Rolling Restart Checks - Specifies the wait time between check attempts (default: 30 seconds)
API Timeout For kafka-metadata-quorum Client Used In Kafka Controller Rolling Restart Checks - Specifies the timeout for the kafka-metadata-quorum command (default: 60 seconds)
Maximum Allowed Lag for Active Kafka Controllers - Specifies the maximum lag in messages before a controller is considered inactive (default: 5 messages)
Maximum Allowed Lag in LastFetchTimestamp for Active Kafka Controllers - Specifies the maximum timestamp lag before a controller is considered inactive (default: 5 seconds)

Advanced configuration for broker checks

There are two scenarios when additional configuration is required for Kafka broker rolling restart checks. These scenarios are as follows:

Kafka brokers are configured to use a custom listeners: If you configured your Kafka brokers with advanced configuration snippets to use custom listeners (for example a custom host:port pair), you must manually update both client configuration files that Cloudera Manager generates. This is required because Cloudera Manager might not be able to automatically extract the information required to establish a connection with the Kafka brokers when custom listeners are configured. For more information, see Configuring the client configuration used for rolling restart checks.
A broker connectivity change is made after rolling restart checks are enabled: A broker connectivity change is any type of change made to listeners, bootstrap servers, ports, or security. If a change like this is made after rolling restart checks are enabled, Cloudera Manager uses the newly set configuration to generate the client configuration files. However, until a restart is executed, the Kafka brokers still operate with the old configuration. As a result, Cloudera Manager will run the kafka-topics tool with an invalid configuration causing the check and the rolling restart to fail. In a case like this, you must disable rolling restart checks until the Kafka brokers are restarted at least once. This can be done by setting Cluster Health Guarantee During Rolling Restart to none. Following the initial restart, the brokers will operate with the new configuration and you can re-enable rolling restart checks.

Configuring rolling restart checks for brokers

You can configure Cloudera Manager to perform different types of checks on Kafka brokers during a rolling restart. The type of check performed by Cloudera Manager is configured with the Cluster Health Guarantee During Rolling Restart property. The property has multiple settings, each setting corresponds to a different type of check.

If your Kafka service is configured to use custom listeners, complete Configuring the client configuration used for rolling restart checks before continuing with this task.

In Cloudera Manager, select the Kafka service.
Go to Configuration.
Find and configure the Cluster Health Guarantee During Rolling Restart property.
Select one of the available options. Click the ? icon next to the property name to reveal a full description of each option and the check to which they correspond.
Cloudera recommends that you set this property to all partitions stay healthy to avoid service outages.
Optional: Fine-tune rolling restart check behaviour by configuring the following properties:
- Maximum Allowed Runtime For Kafka Broker Rolling Restart Check
- Retry Interval For Kafka Broker Rolling Restart Check
- Default API Timeout For Kafka Topics Client Used In Kafka Broker Rolling Restart Check
These properties allow you to configure different interval and timeout values related to the rolling restart check. Configure these properties based on your cluster and requirements.
Click Save Changes.
Restart the Kafka service.

Rolling restart checks are configured and enabled. During any subsequent rolling restarts, Cloudera Manager executes the type of check you configured.

If you make any configuration changes related to broker connectivity (security, listeners, port, bootstrap) after rolling restart checks are enabled, you must disable rolling restart checks for the first restart after the change was made. Otherwise, the check and the rolling restart might fail. Following the initial restart, you can re-enable rolling restart checks.

Configuring rolling restart checks for controllers

You can configure Cloudera Manager to enable or disable checks on KRaft controllers during a rolling restart using the KRaft Cluster Health Guarantee During Rolling Restart property. Additionally, you can fine-tune the behavior of these checks using various timeout and threshold properties.

KRaft controller rolling restart checks are only applicable when KRaft is used as the metadata store for the Kafka service.

In Cloudera Manager, select the Kafka service.
Go to Configuration.
Find and configure the KRaft Cluster Health Guarantee During Rolling Restart property.
Select none to disable checks. Otherwise, to have checks enabled, leave it at the default majority of controllers online to maintain metadata quorum availability during rolling restarts.
Optional: Fine-tune KRaft controller rolling restart check behaviour by configuring the following properties:
- Maximum Allowed Runtime for Kafka Controller Rolling Restart Checks
- Retry Interval for Kafka Controller Rolling Restart Checks
- API Timeout For kafka-metadata-quorum Client Used In Kafka Controller Rolling Restart Checks
- Maximum Allowed Lag for Active Kafka Controllers
- Maximum Allowed Lag in LastFetchTimestamp for Active Kafka Controllers
These properties allow you to configure different interval and timeout values related to the rolling restart check. Configure these properties based on your cluster and requirements.
Click Save Changes.
Restart the Kafka service.

KRaft controller rolling restart checks are configured. The configuration is used during subsequent rolling restarts.

Configuring the client configuration used for broker rolling restart checks

Cloudera Manager requires Kafka client configuration files to perform rolling restart checks on brokers. These files are generated automatically. However, if your Kafka service has custom listeners configured, you must manually update these client configuration files. Otherwise, the rolling restart check might fail.

When Cloudera Manager executes a rolling restart check, it uses the kafka-topics tool to gather information about the brokers, topics, and partitions. The kafka-topics tool requires a valid client configuration file to run. Cloudera Manager automatically generates two configuration files for this purpose. One is used for the kafka-topics commands initiated before the brokers are stopped, the other, after brokers are restarted.

If your Kafka service is configured to use custom listeners, you must manually update the configuration files generated by Cloudera Manager. This is required because Cloudera Manager might not be able to automatically extract the information required to establish a connection with the Kafka service when custom listeners are configured. The client configuration files can be updated using advanced security snippets.

In Cloudera Manager, select the Kafka service.
Go to Configuration.
Manually update the client configuration files used during rolling restart checks.
This can be done by adding a valid client configuration to the following advanced configuration snippets:
- Kafka Broker Advanced Configuration Snippet (Safety Valve) for rolling_restart_check_before_stop_admin_client_configs.properties
- Kafka Broker Advanced Configuration Snippet (Safety Valve) for rolling_restart_check_after_start_admin_client_configs.properties
Ensure that you add the same client configuration to both snippets. The client configuration you add must contain all properties that are required to establish a connection with the brokers. The client configuration you add here is similar to any other client configuration you create for Kafka command line tools. However, this specific configuration accepts the bootstrap.servers property. Use this property to specify your custom host:port pairs that you use as your custom listeners.
The following client configuration example is for a Kafka service that has both TLS/SSL and Kerberos enabled. You can use this example as a template and make changes as needed. For more client configuration examples, see the Securing Apache Kafka publication in the Streams Messaging documentation.
```
bootstrap.servers=[***HOST***]:[***PORT***]
security.protocol=SASL_SSL
ssl.client.auth=none
sasl.mechanism=GSSAPI
sasl.kerberos.service.name=kafka
sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true keyTab="[***PATH TO KEYTAB***]" principal="[***KERBEROS PRINCIPAL***]";
ssl.keystore.location=[***PATH TO KEYSTORE.JKS***]
ssl.key.password=[***PASSWORD***]
ssl.keystore.password=[***PASSWORD***]
ssl.keystore.type=jks
ssl.truststore.location=[***PATH TO TRUSTSTORE.JKS***]
ssl.truststore.type=jks
ssl.truststore.password=[***PASSWORD***]
```
Click Save Changes.

The client configuration files used by Cloudera Manager during rolling restart checks are configured.

Enable and configure rolling restart checks. Complete Configuring rolling restart checks for brokers.