Cruise Control Rebase Summary

In CDP Private Cloud Base 7.1.8, Cruise Control is rebased from 2.0.100 to the 2.5.66 version. Other than the added new feature, several issues are fixed and several features are enhanced to have a better perfomance when using Cruise Control.

Table 1. Fixed Issues
PR-1184 Fix the bug in replica movement strategy chaining PR-1291 Fix GOAL_VIOLATION detector getting stuck execution in GENERATING_PROPOSALS_FOR_EXECUTION state
PR-1209Fix NPE if task execution takes longer than executionProgressCheckIntervalMs() PR-1381 Fix a bug that might cause invalid throttle replica list to be used
PR-1231 Fix reported balancedness when there are offline brokers PR-1476 Fix miscalculated recommendation for the number of racks to drop for clusters over-provisioned wrt rack count
PR-1232 Fix returns for completed_with_error tasks on user_tasks endpoint when json=false PR-1597 Fix incorrect values generated for number of replicas by topic
PR-1238 Fix inconsistent/bad response in monitor substate PR-1616 Fix throttler quota removal for in-progress tasks
PR-1279 Fix missed broker failure detection/self-healing upon bootstrap PR-1676 Fix EnvConfigProvider to work well if there is no pre configured env vars
Table 2. Version Update
PR-1233 Upgrade to Kafka 2.5.0 in development branch migrate_to_kafka_2_5
PR-1311 Add support for Kafka 2.6 brokers
PR-1471 Add support for Kafka version 2.7
PR-1612 Upgrade to Kafka 2.8 libraries
PR-1614 Updated to Scala 2.13
Table 3. Feature Support
PR-1159 Support SPNEGO and trusted proxy authentication PR-1569 Add support to switch from ZK to Kafka Admin Client for topic config provider class
PR-1245, PR-1320 Bump up ZK session and connection timeout PR-1583 Ability to configure TLS protocols and ciphers via configuration
PR-1510 Add support for Alerta.io notifications PR-1661 Support connecting to ZooKeeper with TLS
PR-1525 Add support for (At/Under)MinISR-based throttling/cancellation PR-1703 Add ZK TLS support with properties file and modify broker failure detection
Table 4. Goal Improvements
PR-1203 Add missing sanity check for goals PR-1400 Add gap-based balance limits for TopicReplicaDistributionGoal
PR-1267 LeaderReplicaDistributionGoal should honor excludedTopics during leadership movement PR-1420 Relax low resource utlization upper limit for resource distribution goals
PR-1306 Update min valid windows required to start self-healing with goals using resource history PR-1429 Add a new hard goal that ensures that each alive broker has a leader replica from a configured pool of topics
PR-1324 Add support for goal-based operations via maintenance event PR-1500 Drop enforcement that anomaly.detection.goals must be a subset of self.healing.goals (if non-empty)
PR-1345 Add a new hard goal that evenly distributes replicas over racks PR-1514 Prevent ResourceDistributionGoal from generating provision recommendations with a negative number of brokers to remove
PR-1383 Ensure that topology distribution goals compute balance constraints properly PR-1564 Add timers to track goal violation detection and fix generation for self-healing
PR-1385 Add support to switch from ZK to Kafka Admin Client for topic config provider class
Table 5. Functonality Enhancements
PR-868 Provide capacity stats for a broker PR-1302 Support stop ongoing executions with rollback PR-1448 Reset the provision status to ensure freshness for consecutive optimizations
PR-1177 Make TopicReplicationFactorAnomalyFinder ignore topic with large minISR PR-1313 Make min execution progress check interval and slow task alerting backoff configurable PR-1456 Update slow broker detection sensitivity and reporting details
PR-1180 Update min valid windows required to start self-healing with goals using resource history PR-1316 Added option to configure metrics topic minISR PR-1460 Add a config for admin client request timeout
PR-1186 Further filter detected slow broker against pre-defined flush time threshold PR-1332 Support handling planned maintenance events submitted via a topic PR-1463 Add sensors to report the number of slow brokers
PR-1190 [ccclient-1.1.0] Add force_stop parameter PR-1334 [ccclient-1.1.1] Extend support for anomaly detectors PR-1469 Report ongoing replica reassignments started by an external agent
PR-1196 Adopt admin client-based replica reassignment API for Kafka 2.4+ PR-1341 Adopt a shared AdminClient across selected CC components PR-1470 Provide recommendations on the estimated resource requirements
PR-1198 Support environment variable resolution in configs PR-1349 Make MetricSampler more extensible PR-1484 Add a sensor to indicate the metadata factor of the managed Kafka cluster
PR-1199 Add rack information to load response PR-1357 Add check to validate that time range start time is smaller than end time PR-1485 Add a sensor to indicate if the cluster has partitions with RF > the number of eligible racks
PR-1212 Make CPU capacity threshold stricter PR-1358 Check whether a cluster is using JBOD when populate_disk_info is true PR-1496 Stop execution before shutting down Cruise Control
PR-1214 Update the Balancedness Score under Unhealthy Cluster State PR-1360 Handle Wrapped AdminClient Timeouts as Timeouts PR-1505 Make CC inter-broker replica reassignments resilient against partitions with ISR set > replica set
PR-1220 Enable configurable backoff on metrics topic creation PR-1362 Prevent MaintenanceEventTopicReader from prematurely closing the adminClient PR-1509 Add a sensor to identify if the cluster has partitions with ISR > replicas
PR-1223 Ensure that requests to update replication factor cannot cause a deadlock PR-1364 Let implementations of OptimizationOptionsGenerator be configured with an AdminClient PR-1533 Avoid NPE due to misused rebalance_disk parameter
PR-1225 Allow customization of CruiseControlMetricsReporterSampler PR-1368 Automate leadership concurrency adjustment based on broker metrics PR-1538 Enable partition metric collection to a configured topic during ongoing execution
PR-1234 Ensure consistent rackID during topic_configuration operations to avoid NPE PR-1369 Enable Cruise Control to collect metrics from low traffic clusters by default PR-1559 Add a ReplicaMovementStrategy that prioritizes (At/Under)MinISR partitions with offline replicas
PR-1241 Add support to retrieve capacity only via load endpoint PR-1391 Calculate balance lower bound for resource distribution lower bound with low utilization threshold PR-1589 Add parameters to relevant endpoints to control the speed of proposal generation
PR-1246 Enable Kafka port retrieval from listeners config PR-1401 Honor webserver.api.urlprefix config PR-1593 Set the default proposal generation speed to fast mode
PR-1255 Enable broker metric collection during ongoing executions PR-1407 Provide idempotency support to handle duplicate maintenance events PR-1599 Add a config to enable/disable provisioner
PR-1256 Prevent Cruise Control from mistakenly believe that there is an ongoing execution PR-1409 [ccclient-1.1.2] Add topic parameter in cruise-control-client to query the kafka_cluster_state endpoint PR-1604 Move broker failure detector away from using zNode-based failed broker persistence
PR-1282 Let executor substate show information on process before starting an execution PR-1410 Add a sensor to emit topic count in cluster PR-1608 Disable bootstrap endpoint in non-developer_mode
PR-1287 Prevent concurrent execution request from corrupting ongoing execution state PR-1412 Enable detecting and fixing maintenance events during ongoing executions PR-1622 Enable users to monitor the min.insync.replicas of all topics
PR-1289 Automate replica reassignment concurrency adjustment based on broker metrics PR-1418 Add missing population of anomaly details for maintenance events PR-1635 Add config to skip rack-awareness check while self-healing RF anomalies
PR-1294 Add sensors to emit concurrency caps for partition and leadership reassignments PR-1419 Handle non-existent topic while setting/removing throttled replicas for a topic PR-1636 Add capability to stop executions not started by Cruise Control
PR-1297 Make timeout for listing partition reassignments configurable along with a retry logic PR-1433 Handle missing listeners config in CruiseControlMetricsReporter PR-1637 Detect RF violations for topics having targetReplicationFactor with topicReplicationFactorMargin violation
PR-1646 Setup ProvisionerState for rightsizing clusters PR-1681 Handle metrics reporter exceptions while getting CPU metric