Rolling restart checks
You can configure Cloudera Manager to perform checks during a rolling restart to ensure that Kafka roles stay healthy. Rolling restart checks are available for both Kafka brokers and KRaft controllers, with each configured independently to provide different levels of health guarantees.
Rolling restart checks are enabled by default to ensure cluster health during rolling restarts. For Kafka brokers, the default check ensures that healthy partitions stay healthy during restarts. For KRaft controllers, the default check ensures that the majority of controllers remain online during restarts. These checks help prevent service disruptions that could occur if roles are restarted without verifying cluster health. For example, without checks, Cloudera Manager might restart a broker while the previous broker is not fully ready for operation, causing outages and corrupted log indexes. Similarly, for KRaft controllers, restarting without proper checks can disrupt the metadata quorum and impact cluster operations.
Broker and controller rolling restart checks are configured independently using separate properties. This allows you to disable, enable, or configure different check levels for each type based on your cluster requirements.
- Kafka broker checks are configured using the Cluster Health Guarantee During Rolling Restart property. These checks focus on partition health and replica synchronization to ensure that topics remain accessible and data is not lost during broker restarts.
- KRaft controller checks are configured using the KRaft Cluster Health Guarantee During Rolling Restart property. These checks focus on maintaining metadata quorum health to ensure that the majority of controllers remain available during controller restarts. These checks are only applicable when KRaft is used as the metadata store.
Rolling restart checks for Kafka brokers
Kafka broker rolling restart checks focus on partition health and replica synchronization. There are multiple checks available, each providing a different level of guarantee on Kafka cluster and broker health. The type of check performed is configured with the Cluster Health Guarantee During Rolling Restart property.
- none
- This setting disables rolling restart checks. If this option is selected, no checks are performed and no health guarantees are provided.
- ready for request
- This setting ensures that when a broker is restarted, the restarted broker is accepting and responding to requests made on its service port. The next broker is only restarted after the previous broker is ready for requests.
- healthy partitions stay healthy (default)
-
This setting ensures that no partitions go into an under-min-isr state when a broker is stopped. This is achieved by waiting before each broker is stopped so that all other brokers can catch up with all replicas that are in an at-min-isr state. Additionally, this setting ensures that the restarted broker is accepting and is responding to requests made on its service port before restarting the next broker. This setting ignores partitions which are already in an under-min-isr state.
- all partitions stay healthy (recommended)
-
This setting ensures that no partitions are in an under-min-isr or at-min-isr state when a broker is stopped. This is achieved by waiting before each broker is stopped so that all other brokers can catch up with all replicas that are in an at-min-isr or under-min-isr state. Additionally, this setting ensures that the restarted broker is accepting requests on its service port before the next broker is restarted.
- all partitions fully replicated
- This setting ensures that all partitions are in a fully synchronized state when a broker is stopped. This is achieved by waiting before each broker is stopped so that all other brokers can catch up with all replicas that are out of sync. Additionally, this setting ensures that the restarted broker is accepting requests on its service port before the next broker is restarted.
When Cloudera Manager executes a broker rolling restart check, it uses
the kafka-topics tool to gather information about the brokers, topics, and
partitions. The kafka-topics tool requires a valid client configuration
file to run. In the case of rolling restart checks, two configuration files are required.
One for the kafka-topics commands that are initiated before a broker is
stopped, and a separate one for the commands initiated after a broker is restarted. Cloudera Manager automatically generates these client configuration files
based on the configuration of the Kafka service. These files can also be manually updated
using advanced security snippets.
Using these files, Cloudera Manager executes
kafka-topics commands on the brokers. Based on the response from the
tool, Cloudera Manager either waits for a specified amount of time or
continues with the rolling restart.
Depending on what type of check is configured, Cloudera Manager polls
information with kafka-topics at different points in time. As a result, the
checks can be categorised in two groups. Pre-checks and post-checks. If either
healthy partitions stay healthy or all partitions stay
healthy is selected, information is polled both before a broker is stopped
(pre-check) and after a broker is restarted (post-check). If the ready for
request setting is selected, information is only polled after a broker is
restarted.
If a pre-check fails to find a proper state when a broker can be stopped, the check will
stop the entire rolling restart process. This can happen if the broker that is about to be
stopped still has at-min-isr or under-min-isr partitions after the configured timeout
interval is reached. Post-checks behave in a similar way. If the post-check fails to receive
validation (a correct exit code) within the specified timeout interval from the
kafka-topics command that the broker is ready for requests, the check
will stop the entire rolling restart process. In both of these cases the brokers are not
stopped or restarted. The rolling restart fails and the brokers continue to run.
In addition to configuring and enabling these checks using Cluster Health Guarantee During Rolling Restart, a number of other configuration properties are also available that enable you to fine-tune the behaviour of the checks. For detailed steps on how to enable and configure rolling restart checks, see Configuring rolling restart checks.
Rolling restart checks for KRaft controllers
KRaft controller rolling restart checks are configured independently from Kafka broker rolling restart checks using the KRaft Cluster Health Guarantee During Rolling Restart property. KRaft controller rolling restart checks only apply when KRaft is used as the metadata store for the Kafka service.
The KRaft controller rolling restart checks ensure that the KRaft quorum remains healthy during a rolling restart. The checks verify that the majority of controllers (half + 1) remain available and operational throughout the restart process. Unlike broker checks which focus on partition health, controller checks focus on maintaining metadata quorum health.
- none
- This setting disables KRaft controller rolling restart checks. If this option is selected, no checks are performed on controllers during a rolling restart.
- majority of controllers online (default)
- This setting ensures that during a rolling restart, the majority of KRaft controllers remain online and operational. This is the recommended setting to maintain metadata availability during restarts.
kafka-metadata-quorum
command line tool to verify quorum health. The checks are performed in two phases:- Pre-check (before stopping a controller): Before stopping a controller, Cloudera Manager verifies that the majority of the KRaft quorum will remain available after the controller is stopped. The check counts the number of active controllers (controllers with low lag and recent metadata fetch timestamps) and ensures that at least half + 1 controllers will remain active after stopping the current controller. If the majority requirement cannot be met, the check waits and retries until the requirement is satisfied or a timeout is reached.
- Post-check (after starting a controller): After
restarting a controller, Cloudera Manager verifies that the controller
is actively participating in metadata updates. The check monitors the controller's
LastFetchTimestampand ensures it is increasing, which indicates the controller is successfully fetching and processing metadata updates from the quorum leader. The check succeeds when the timestamp increases.
- Its lag (number of uncommitted metadata messages) is below the configured threshold (default: 5 messages)
- Its
LastFetchTimestampis recent (within the configured threshold, default: 5 seconds) - Its status is either
LeaderorFollower(notObserver)
- Maximum Allowed Runtime for Kafka Controller Rolling Restart Checks - Specifies the overall timeout (default: 15 minutes)
- Retry Interval for Kafka Controller Rolling Restart Checks - Specifies the wait time between check attempts (default: 30 seconds)
- API Timeout For kafka-metadata-quorum Client Used In Kafka Controller
Rolling Restart Checks - Specifies the timeout for the
kafka-metadata-quorumcommand (default: 60 seconds) - Maximum Allowed Lag for Active Kafka Controllers - Specifies the maximum lag in messages before a controller is considered inactive (default: 5 messages)
- Maximum Allowed Lag in LastFetchTimestamp for Active Kafka Controllers - Specifies the maximum timestamp lag before a controller is considered inactive (default: 5 seconds)
Advanced configuration for broker checks
- Kafka brokers are configured to use a custom listeners
- If you configured your Kafka brokers with advanced configuration snippets to use custom listeners (for example a custom host:port pair), you must manually update both client configuration files that Cloudera Manager generates. This is required because Cloudera Manager might not be able to automatically extract the information required to establish a connection with the Kafka brokers when custom listeners are configured. For more information, see Configuring the client configuration used for rolling restart checks.
- A broker connectivity change is made after rolling restart checks are enabled
-
A broker connectivity change is any type of change made to listeners, bootstrap servers, ports, or security. If a change like this is made after rolling restart checks are enabled, Cloudera Manager uses the newly set configuration to generate the client configuration files. However, until a restart is executed, the Kafka brokers still operate with the old configuration. As a result, Cloudera Manager will run the
kafka-topicstool with an invalid configuration causing the check and the rolling restart to fail. In a case like this, you must disable rolling restart checks until the Kafka brokers are restarted at least once. This can be done by setting Cluster Health Guarantee During Rolling Restart to none. Following the initial restart, the brokers will operate with the new configuration and you can re-enable rolling restart checks.
Configuring rolling restart checks for brokers
You can configure Cloudera Manager to perform different types of checks on Kafka brokers during a rolling restart. The type of check performed by Cloudera Manager is configured with the Cluster Health Guarantee During Rolling Restart property. The property has multiple settings, each setting corresponds to a different type of check.
If your Kafka service is configured to use custom listeners, complete Configuring the client configuration used for rolling restart checks before continuing with this task.
Configuring rolling restart checks for controllers
You can configure Cloudera Manager to enable or disable checks on KRaft controllers during a rolling restart using the KRaft Cluster Health Guarantee During Rolling Restart property. Additionally, you can fine-tune the behavior of these checks using various timeout and threshold properties.
KRaft controller rolling restart checks are only applicable when KRaft is used as the metadata store for the Kafka service.
Configuring the client configuration used for broker rolling restart checks
Cloudera Manager requires Kafka client configuration files to perform rolling restart checks on brokers. These files are generated automatically. However, if your Kafka service has custom listeners configured, you must manually update these client configuration files. Otherwise, the rolling restart check might fail.
When Cloudera Manager executes a rolling restart check, it uses the
kafka-topics tool to gather information about the brokers, topics, and
partitions. The kafka-topics tool requires a valid client configuration
file to run. Cloudera Manager automatically generates two configuration files for this
purpose. One is used for the kafka-topics commands initiated before the
brokers are stopped, the other, after brokers are restarted.
If your Kafka service is configured to use custom listeners, you must manually update the configuration files generated by Cloudera Manager. This is required because Cloudera Manager might not be able to automatically extract the information required to establish a connection with the Kafka service when custom listeners are configured. The client configuration files can be updated using advanced security snippets.
Enable and configure rolling restart checks. Complete Configuring rolling restart checks for brokers.
