Enabling self-healing for all or individual anomaly types

Self-healing is disabled for Cruise Control by default. You can enable self-healing in Cloudera Manager using the cruisecontrol.properties configuration, or with a curl POST request and the corresponding anomaly type.

Enabling self-healing in Cloudera Manager🔗

Go to your cluster in Cloudera Manager.
Select Cruise Control from the list of Services.
Click on Configuration tab.
Search for the Cruise Control Server Advanced Configuration Snippet (Safety Valve) for cruisecontrol.properties setting.
Choose to enable self-healing for all or only specific anomaly types, and add the corresponding parameter to the Safety Valve field based on your requirements.
1. To enable self-healing for all anomaly types, add self.healing.enabled=true configuration parameter to the Safety Valve.
2. To enable self-healing for specific anomaly types, add the corresponding configuration parameter to the Safety Valve:
  - self.healing.broker.failure.enabled=true
  - self.healing.goal.violation.enabled=true
  - self.healing.disk.failure.enabled=true
  - self.healing.topic.anomaly.enabled=true
  - self.healing.slow.broker.removal.enabled=true
  - self.healing.metric.anomaly.enabled=true
  - self.healing.maintenance.event.enabled=true

Provide additional configuration to self-healing.

There are additional configurations that you can use to further customize the self-healing process.


Configuration	Value	Description
`anomaly.notifier.class`	`com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier`	The notifier class to trigger an alert when an anomaly is violated. The notifier class must be configured to enable self-healing. For more information, see Enabling self-healing in Cruise Control.
`broker.failure.alert.threshold.ms`	`900,000`	Defines the threshold to mark a broker as dead. If a non-empty broker leaves the cluster at time `T` and did not join the cluster before `T` + `broker.failure.alert.threshold.ms`, the broker is defined as dead broker since `T`. An alert will be triggered in this case.
`broker.failure.self.healing.threshold.ms`	`1,800,000`	If self-healing is enabled and a broker is dead at `T`, self-healing will be triggered at `T` + `broker.failure.self.healing.threshold.ms`.

For more information about the Self- healing configurations, see the Cruise Control documentation.

Provide additional configuration to the anomaly types.

There are additional configurations that you can provide for the anomaly types.


Anomaly type	Configuration	Value	Description
Broker failure	`broker.failures.class`	`com.linkedin.kafka.cruisecontrol.detector.BrokerFailuresfailed.brokers.file.path`	The name of the class that extends broker failures.
	`failed.brokers.file.path`	`fileStore/failedBrokers.txt`	The file path to store the failed broker list. This is to persist the broker failure time in case Cruise Control failed and restarted when some brokers are down.
	`fixable.failed.broker.count.threshold`	`10`	The upper boundary of concurrently failed broker counts that are taken as fixable. If too many brokers are failing at the same time, it is often due to something more fundamental going wrong and removing replicas from failed brokers cannot alleviate the situation.
	`fixable.failed.broker.percentage.threshold`	`0.4`	The upper boundary of concurrently failed broker percentage that are taken as fixable. If a large portion of brokers are failing at the same time, it is often due to something more fundamental going wrong and removing replicas from failed brokers cannot alleviate the situation.
	`broker.failure.detection.backoff.ms`	`300000`	The backoff time in millisecond before broker failure detector triggers another broker failure detection if currently detected broker failure is not ready to fix.
	`kafka.broker.failure.detection.enable`	`false`	Whether to use the Kafka API to detect broker failures instead of ZooKeeper. When enabled, `zookeeper.connect` does not need to be set.
	`broker.failure.detection.interval.ms`	`null`	The interval in millisecond that broker failure detector will run to detect broker failures. If this interval time is not specified, the broker failure detector will run with interval specified in `anomaly.detection.interval.ms`. This is only used when `kafka.broker.failure.detection.enable` is set to 'true'.
Goal violation	`goal.violations.class`	`com.linkedin.kafka.cruisecontrol.detector.GoalViolations`	The name of the class that extends goal violations.
	`anomaly.detection.goals`	For the list of available goals, see the Configuring goals section.	The goals that the anomaly detector should detect if they are violated.
	`goal.violation.detection.interval.ms`	value of `anomaly.detection.interval.ms`	The interval in millisecond that goal violation detector will run to detect goal violations. If this interval time is not specified, goal violation detector will run with interval specified in `anomaly.detection.interval.ms`.
Disk failure	`disk.failures.class`	`com.linkedin.kafka.cruisecontrol.detector.DiskFailures`	The name of the class that extends disk failures anomaly.
Disk failure	`disk.failure.detection.interval.ms`	value of `anomaly.detection.interval.ms`	The interval in millisecond that disk failure detector will run to detect disk failures. If this interval time is not specified, disk failure detector will run with interval specified in `anomaly.detection.interval.ms`.
Topic anomaly	`topic.anomaly.detection.interval.ms`	value of `anomaly.detection.interval.ms`	The interval in millisecond that topic anomaly detector will run to detect topic anomalies. If this interval time is not specified, topic anomaly detector will run with interval specified in `anomaly.detection.interval.ms`.
Topic anomaly	`topic.anomaly.finder.class`	`com.linkedin.kafka.cruisecontrol.detector.NoopTopicAnomalyFinder`	A list of topic anomaly finder classes to find the current state to identify topic anomalies.
Slow broker	`slow.broker.bytes.in.rate.detection.threshold`	`1024.0`	The bytes in rate threshold in units of kilobytes per second to determine whether to include brokers in slow broker detection.
	`slow.broker.log.flush.time.threshold.ms`	`1000.0`	The log flush time threshold in units of millisecond to determine whether to detect a broker as a slow broker.
	`slow.broker.metric.history.percentile.threshold`	`90.0`	The percentile threshold used to compare the latest metric value against historical value in slow broker detection.
	`slow.broker.metric.history.margin`	`3.0`	The margin used to compare the latest metric value against historical value in slow broker detection.
	`slow.broker.peer.metric.percentile.threshold`	`50.0`	The percentile threshold used to compare last metric value against peers' latest value in slow broker detection.
	`slow.broker.peer.metric.margin`	`10.0`	The margin used to compare last metric value against peers' latest value in slow broker detection.
	`slow.broker.demotion.score`	`5`	The score threshold to trigger a demotion for slow brokers.
	`slow.broker.decommission.score`	`50`	The score threshold to trigger a removal for slow brokers.
	`slow.broker.self.healing.unfixable.ratio`	`0.1`	The maximum ratio of slow brokers in the cluster to trigger self-healing operation.
Metric anomaly	`metric.anomaly.class`	`com.linkedin.kafka.cruisecontrol.detector.KafkaMetricAnomaly`	The name of class that extends metric anomaly.
Metric anomaly	`metric.anomaly.detection.interval.ms`	value of `anomaly.detection.interval.ms`	The interval in millisecond that metric anomaly detector will run to detect metric anomalies. If this interval time is not specified, the metric anomaly detector will run with the interval specified in `anomaly.detection.interval.ms`.
Maintenance event	`maintenance.event.reader.class`	`com.linkedin.kafka.cruisecontrol.detector.NoopMaintenanceEventReader`	A maintenance event reader class to retrieve maintenance events from the user-defined store.
	`maintenance.event.class`	`com.linkedin.kafka.cruisecontrol.detector.MaintenanceEvent`	The name of the class that extends the maintenance event.
	`maintenance.event.enable.idempotence`	`true`	The flag to indicate whether maintenance event detector will drop the duplicate maintenance events detected within the configured retention period.
	`maintenance.event.idempotence.retention.ms`	`180000`	The maximum time in ms to store events retrieved from the MaintenanceEventReader. Relevant only if idempotency is enabled (see `maintenance.event.enable.idempotence`).
	`maintenance.event.max.idempotence.cache.size`	`25`	The maximum number of maintenance events cached by the MaintenanceEventDetector within the past maintenance.event.idempotence.retention.ms ms. Relevant only if idempotency is enabled (see `maintenance.event.enable.idempotence`).
	`maintenance.event.stop.ongoing.execution`	`true`	The flag to indicate whether a maintenance event will gracefully stop the ongoing execution (if any) and wait until the execution stops before starting a fix for the anomaly.

For more information about the Anomaly detector configurations, see the Cruise Control documentation.

Click Save changes.
Click on Action > Restart next to the Cruise Control service name to restart Cruise Control.

Enabling self-healing using REST API🔗

Open a command line tool.
Use ssh and connect to your cluster running Cruise Control.
```
ssh root@<your_hostname>
```
You will be prompted to provide your password.
Enable self-healing for the required anomaly types using the following POST command:
```
POST /kafkacruisecontrol/admin?enable_self_healing_for=[anomaly_type]
```
The following parameters must be used for anomaly_type:
- GOAL_VIOLATION
- BROKER_FAILURE
- METRIC_ANOMALY
- DISK_FAILURE
- TOPIC_ANOMALY
note
In case you do not want to enable self-healing for certain anomaly types, you can disable them by using the following command:
```
POST /kafkacruisecontrol/admin?disable_self_healing_for=[anomaly_type]
```
Check which anomalies are currently in use, and which are detected with the following GET command:
```
GET /kafkacruisecontrol/state
```

When reviewing the state of Cruise Control, you can check the status of Anomaly Detector at the following parameters:

selfHealingEnabled - Anomaly type for which self-healing is enabled
selfHealingDisabled - Anomaly type for which self healing is disabled
recentGoalViolations - Recently detected goal violations
recentBrokerFailures - Recently detected broker failures
recentDiskFailures - Recently detected disk failures
recentMetricAnomalies - Recently detected metric anomalies

Enabling self-healing for all or individual anomaly types

Enabling self-healing in Cloudera Manager🔗

Enabling self-healing using REST API🔗

We want your opinion

How can we improve this page?