Enabling self-healing for all or individual anomaly types
Self-healing is disabled for Cruise Control by default. You can enable self-healing in Cloudera Manager using the cruisecontrol.properties configuration, or with a curl POST request and the corresponding anomaly type.
Enabling self-healing in Cloudera Manager
- Go to your cluster in Cloudera Manager.
- Select Cruise Control from the list of Services.
- Click on Configuration tab.
- Search for the Cruise Control Server Advanced Configuration Snippet (Safety Valve) for cruisecontrol.properties setting.
- Choose to enable self-healing for all or only specific anomaly types, and add the
corresponding parameter to the Safety Valve field based on your requirements.
- To enable self-healing for all anomaly types, add
self.healing.enabled=true
configuration parameter to the Safety Valve. - To enable self-healing for specific anomaly types, add the corresponding configuration
parameter to the Safety Valve:
self.healing.broker.failure.enabled=true
self.healing.goal.violation.enabled=true
self.healing.disk.failure.enabled=true
self.healing.topic.anomaly.enabled=true
self.healing.slow.broker.removal.enabled=true
self.healing.metric.anomaly.enabled=true
self.healing.maintenance.event.enabled=true
- To enable self-healing for all anomaly types, add
- Provide additional configuration to self-healing.There are additional configurations that you can use to further customize the self-healing process.
For more information about the Self- healing configurations, see the Cruise Control documentation.Configuration Value Description anomaly.notifier.class
com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier The notifier class to trigger an alert when an anomaly is violated. The notifier class must be configured to enable self-healing. For more information, see Enabling self-healing in Cruise Control. broker.failure.alert.threshold.ms
900,000 Defines the threshold to mark a broker as dead. If a non-empty broker leaves the cluster at time T and did not join the cluster before T + broker.failure.alert.threshold.ms
, the broker is defined as dead broker sinceT
. An alert will be triggered in this case.broker.failure.self.healing.threshold.ms
1,800,000 If self-healing is enabled and a broker is dead at T
, self-healing will be triggered atT
+broker.failure.self.healing.threshold.ms
. - Provide additional configuration to the anomaly types.There are additional configurations that you can provide for the anomaly types.
For more information about the Anomaly detector configurations, see the Cruise Control documentation.Anomaly type Configuration Value Description Broker failure broker.failures.class
com.linkedin.kafka.cruisecontrol.detector.BrokerFailuresfailed.brokers.file.path The name of the class that extends broker failures. failed.brokers.file.path
fileStore/failedBrokers.txt The file path to store the failed broker list. This is to persist the broker failure time in case Cruise Control failed and restarted when some brokers are down. fixable.failed.broker.count.threshold
10 The upper boundary of concurrently failed broker counts that are taken as fixable. If too many brokers are failing at the same time, it is often due to something more fundamental going wrong and removing replicas from failed brokers cannot alleviate the situation. fixable.failed.broker.percentage.threshold
0.4 The upper boundary of concurrently failed broker percentage that are taken as fixable. If a large portion of brokers are failing at the same time, it is often due to something more fundamental going wrong and removing replicas from failed brokers cannot alleviate the situation. broker.failure.detection.backoff.ms
300000 The backoff time in millisecond before broker failure detector triggers another broker failure detection if currently detected broker failure is not ready to fix. kafka.broker.failure.detection.enable
false Whether to use the Kafka API to detect broker failures instead of ZooKeeper. When enabled, zookeeper.connect
does not need to be set.broker.failure.detection.interval.ms
null The interval in millisecond that broker failure detector will run to detect broker failures. If this interval time is not specified, the broker failure detector will run with interval specified in anomaly.detection.interval.ms
. This is only used whenkafka.broker.failure.detection.enable
is set to 'true'.Goal violation goal.violations.class
com.linkedin.kafka.cruisecontrol.detector.GoalViolations The name of the class that extends goal violations. anomaly.detection.goals
For the list of available goals, see the Configuring goals section. The goals that the anomaly detector should detect if they are violated. goal.violation.detection.interval.ms
value of anomaly.detection.interval.ms
The interval in millisecond that goal violation detector will run to detect goal violations. If this interval time is not specified, goal violation detector will run with interval specified in anomaly.detection.interval.ms
.Disk failure disk.failures.class
com.linkedin.kafka.cruisecontrol.detector.DiskFailures The name of the class that extends disk failures anomaly. disk.failure.detection.interval.ms
value of anomaly.detection.interval.ms
The interval in millisecond that disk failure detector will run to detect disk failures. If this interval time is not specified, disk failure detector will run with interval specified in anomaly.detection.interval.ms
.Topic anomaly topic.anomaly.detection.interval.ms
value of anomaly.detection.interval.ms
The interval in millisecond that topic anomaly detector will run to detect topic anomalies. If this interval time is not specified, topic anomaly detector will run with interval specified in anomaly.detection.interval.ms
.topic.anomaly.finder.class
com.linkedin.kafka.cruisecontrol.detector.NoopTopicAnomalyFinder A list of topic anomaly finder classes to find the current state to identify topic anomalies. Slow broker slow.broker.bytes.in.rate.detection.threshold
1024.0 The bytes in rate threshold in units of kilobytes per second to determine whether to include brokers in slow broker detection. slow.broker.log.flush.time.threshold.ms
1000.0 The log flush time threshold in units of millisecond to determine whether to detect a broker as a slow broker. slow.broker.metric.history.percentile.threshold
90.0 The percentile threshold used to compare the latest metric value against historical value in slow broker detection. slow.broker.metric.history.margin
3.0 The margin used to compare the latest metric value against historical value in slow broker detection. slow.broker.peer.metric.percentile.threshold
50.0 The percentile threshold used to compare last metric value against peers' latest value in slow broker detection. slow.broker.peer.metric.margin
10.0 The margin used to compare last metric value against peers' latest value in slow broker detection. slow.broker.demotion.score
5 The score threshold to trigger a demotion for slow brokers. slow.broker.decommission.score
50 The score threshold to trigger a removal for slow brokers. slow.broker.self.healing.unfixable.ratio
0.1 The maximum ratio of slow brokers in the cluster to trigger self-healing operation. Metric anomaly metric.anomaly.class
com.linkedin.kafka.cruisecontrol.detector.KafkaMetricAnomaly The name of class that extends metric anomaly. metric.anomaly.detection.interval.ms
value of anomaly.detection.interval.ms
The interval in millisecond that metric anomaly detector will run to detect metric anomalies. If this interval time is not specified, the metric anomaly detector will run with the interval specified in anomaly.detection.interval.ms
.Maintenance event maintenance.event.reader.class
com.linkedin.kafka.cruisecontrol.detector.NoopMaintenanceEventReader A maintenance event reader class to retrieve maintenance events from the user-defined store. maintenance.event.class
com.linkedin.kafka.cruisecontrol.detector.MaintenanceEvent The name of the class that extends the maintenance event. maintenance.event.enable.idempotence
true The flag to indicate whether maintenance event detector will drop the duplicate maintenance events detected within the configured retention period. maintenance.event.idempotence.retention.ms
180000 The maximum time in ms to store events retrieved from the MaintenanceEventReader. Relevant only if idempotency is enabled (see maintenance.event.enable.idempotence
).maintenance.event.max.idempotence.cache.size
25 The maximum number of maintenance events cached by the MaintenanceEventDetector within the past maintenance.event.idempotence.retention.ms ms. Relevant only if idempotency is enabled (see maintenance.event.enable.idempotence
).maintenance.event.stop.ongoing.execution
true The flag to indicate whether a maintenance event will gracefully stop the ongoing execution (if any) and wait until the execution stops before starting a fix for the anomaly. - Click Save changes.
- Click on next to the Cruise Control service name to restart Cruise Control.
Enabling self-healing using REST API
- Open a command line tool.
- Use ssh and connect to your cluster running Cruise
Control.
ssh root@<your_hostname>
You will be prompted to provide your password.
- Enable self-healing for the required anomaly types using the following POST
command:
POST /kafkacruisecontrol/admin?enable_self_healing_for=[anomaly_type]
The following parameters must be used for anomaly_type:GOAL_VIOLATION
BROKER_FAILURE
METRIC_ANOMALY
DISK_FAILURE
TOPIC_ANOMALY
- Check which anomalies are currently in use, and which are detected with the following GET
command:
GET /kafkacruisecontrol/state
When reviewing the state of Cruise Control, you can check the status of Anomaly Detector at
the following parameters:
selfHealingEnabled
- Anomaly type for which self-healing is enabledselfHealingDisabled
- Anomaly type for which self healing is disabledrecentGoalViolations
- Recently detected goal violationsrecentBrokerFailures
- Recently detected broker failuresrecentDiskFailures
- Recently detected disk failuresrecentMetricAnomalies
- Recently detected metric anomalies