Troubleshooting ZooKeeper to KRaft migration
Common issues and solutions for troubleshooting problems that might occur during the migration of Kafka clusters from ZooKeeper to KRaft mode.
KRaft Controllers were started without migration configuration
Condition
When KRaft Controller roles are added to the cluster in Cloudera Private Cloud Base, by default they are stopped, but it is
possible to start them without enabling migration mode. If this is done, then the
controllers start the KRaft cluster in full KRaft mode. This is a problem because
the internal KRaft migration state will be initialized in the terminal KRaft
cluster state. Migration will not happen and cannot be initiated in
this state.
Confirm this by looking at Cloudera Manager command history and the health history of the controller instances. If KRaft Controllers are running but there was no migration started, then controllers were started manually.
Further confirmation can be done by reviewing the
kafka_zk_migration_state metric. Use the following query in to review the metric.
SELECT kafka_zk_migration_state
A metric value of 0 indicates that controllers were started without enabling migration mode and the cluster is started in full KRaft mode.
Cause
KRaft Controller role instances were added to the cluster and started manually without migration configuration. That is, they were started manually and not through the Migrate Kafka to KRaft action.
Remedy
Newly provisioned KRaft controllers cannot maintain leader or quorum
Condition
The KRaft controllers cannot maintain a stable leader election or quorum, the cluster's core metadata management is at risk. The active controller chart is frequently updated with different controllers assuming leadership.
Cause
Networking (high likelihood) or disk issues (low likelihood).
Remedy
Broker connectivity issues
Condition
Brokers fail to correctly register with the KRaft controller quorum after their configuration is updated and started in migration mode.
Cause
Brokers show concerning (yellow) or bad (red) health in Cloudera Manager. Additionally, the logs show that brokers cannot register to the KRaft quorum.
Remedy
The remedy is highly dependent on the specific problem, and you will need to review logs to find the root cause. Before you review logs, ensure that brokers are connected to ZooKeeper and serve traffic normally.
Resume migration if the issue is resolved. Alternatively, revert the migration.
Performance degradation
Condition
Experiencing unexpected or significant performance drops, high latency, or increased resource utilization (CPU/memory) on the brokers or controllers. Relevant charts in Cloudera Manager or in any other metrics monitoring applications show the performance degradation of brokers, controllers, or clients.
Cause
This issue might surface during migration, but it might be unrelated to it. Consumer and Connect group rebalances might cause a performance degradation with unbalanced clusters.
Remedy
Metadata consistency issues
Condition
Observing errors in metadata synchronization between the KRaft quorum and ZooKeeper during the dual-write phase and during migration to KRaft mode.
-
Brokers have inconsistent information about partition leaders, causing partitions to fall out of the ISR.
-
Topic configurations might differ for topics in Zookeeper and in KRaft.
-
Brokers might be different in Zookeeper and KRaft metadata.
Cause
As a result of network issues, bugs, unsupported configurations, or misconfiguration, metadata consistency issues might happen when the brokers are being restarted in KRaft mode. During this phase, a part of the cluster already reads metadata from KRaft, while the other parts still use ZooKeeper. Parts of the cluster that still use ZooKeeper might learn about updates much later if there is a network glitch or a bug.
Remedy
- Do network diagnostics to ensure that buffers are set up for the latency of the network (especially in stretch clusters).
- Check DNS response times. Specifically check if Kafka spends a lot of time resolving DNS addresses. Sometimes DNS resolution problems can cause unstable messaging in Kafka.
- Check for authentication or encryption issues. A slow or inconsistent
Key Distribution Center (KDC) might cause connections or authentication
requests to lag. This in turn slows down in-sync replicas (ISR) and
UpdateMetadatarequests which may affect metadata propagation. - Consider reverting the migration, fixing any network issues, then restart the migration.
Client compatibility
Condition
A critical application or tooling component (especially older ones) relies directly on ZooKeeper paths or functionality that is broken during the dual-write phases. Such errors have a very low chance since the last Java client that directly accesses ZooKeeper has been removed in Kafka 0.10, but third-party clients not supported by Cloudera might still work with ZooKeeper connections. Ideally, this problem should only surface in development or test environments.
Cause
Client applications cannot communicate with the brokers. The crash happens when the brokers enter their dual-write phase.
Remedy
Migration fails with NoAuthException
Condition
Zookeeper Access Control List (ACL) synchronization is done by the migration command as a preparatory step. Any subsequent manual changes to the ACLs can potentially break the migration. Such issues can happen if you use Kafka CLI tools to modify topic configuration or change the topics. Essentially, changing the ZooKeeper structure with any CLI commands can break migration.
Exception in thread "main" org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-specific-nodeCause
ZooKeeper ACLs were changed during migration. Kafka command line tools were used to update topics or their configurations.
Remedy
- Apply the same ACLs to the node that is printed in the exception as all the others.
- Resume migration.
The kraft user is not authorized to perform actions
Condition
Authorization errors are included in the Kafka and KRaft logs that say that the
kraft user is not authorized to perform actions.
Cause
Ranger policies are not updated for KRaft, or are misconfigured.
