Enabling self-healing for all or individual anomaly types

Self-healing is disabled for Cruise Control by default. You can enable self-healing in Cloudera Manager using the cruisecontrol.properties configuration, or with a curl POST request and the corresponding anomaly type.

Enabling self-healing in Cloudera Manager

  1. Go to your cluster in Cloudera Manager.
  2. Select Cruise Control from the list of Services.
  3. Click on Configuration tab.
  4. Search for the Cruise Control Server Advanced Configuration Snippet (Safety Valve) for cruisecontrol.properties setting.
  5. Choose to enable self-healing for all or only specific anomaly types, and add the corresponding parameter to the Safety Valve field based on your requirements.
    1. To enable self-healing for all anomaly types, add self.healing.enabled=true configuration parameter to the Safety Valve.
    2. To enable self-healing for specific anomaly types, add the corresponding configuration parameter to the Safety Valve:
      • self.healing.broker.failure.enabled=true
      • self.healing.goal.violation.enabled=true
      • self.healing.disk.failure.enabled=true
      • self.healing.topic.anomaly.enabled=true
      • self.healing.slow.broker.removal.enabled=true
      • self.healing.metric.anomaly.enabled=true
      • self.healing.maintenance.event.enabled=true
  6. Provide additional configuration to self-healing.
    There are additional configurations that you can use to further customize the self-healing process.
    Configuration Value Description
    anomaly.notifier.class com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier The notifier class to trigger an alert when an anomaly is violated. The notifier class must be configured to enable self-healing. For more information, see Enabling self-healing in Cruise Control.
    broker.failure.alert.threshold.ms 900,000 Defines the threshold to mark a broker as dead. If a non-empty broker leaves the cluster at time T and did not join the cluster before T + broker.failure.alert.threshold.ms, the broker is defined as dead broker since T. An alert will be triggered in this case.
    broker.failure.self.healing.threshold.ms 1,800,000 If self-healing is enabled and a broker is dead at T,self-healing will be triggered at T + broker.failure.self.healing.threshold.ms.
    For more information about the Self- healing configurations, see the Cruise Control documentation.
  7. Provide additional configuration to the anomaly types.
    There are additional configurations that you can provide for the anomaly types.
    Anomaly type Configuration Value Description
    Broker failure broker.failures.class com.linkedin.kafka.cruisecontrol.detector.BrokerFailuresfailed.brokers.file.path The name of the class that extends broker failures.
    failed.brokers.file.path fileStore/failedBrokers.txt The file path to store the failed broker list. This is to persist the broker failure time in case Cruise Control failed and restarted when some brokers are down.
    fixable.failed.broker.count.threshold 10 The upper boundary of concurrently failed broker counts that are taken as fixable. If too many brokers are failing at the same time, it is often due to something more fundamental going wrong and removing replicas from failed brokers cannot alleviate the situation.
    fixable.failed.broker.percentage.threshold 0.4 The upper boundary of concurrently failed broker percentage that are taken as fixable. If a large portion of brokers are failing at the same time, it is often due to something more fundamental going wrong and removing replicas from failed brokers cannot alleviate the situation.
    broker.failure.detection.backoff.ms 300000 The backoff time in millisecond before broker failure detector triggers another broker failure detection if currently detected broker failure is not ready to fix.
    kafka.broker.failure.detection.enable false Whether to use the Kafka API to detect broker failures instead of ZooKeeper. When enabled, zookeeper.connect does not need to be set.
    broker.failure.detection.interval.ms null The interval in millisecond that broker failure detector will run to detect broker failures. If this interval time is not specified, the broker failure detector will run with interval specified in anomaly.detection.interval.ms. This is only used when kafka.broker.failure.detection.enable is set to 'true'.
    Goal violation goal.violations.class com.linkedin.kafka.cruisecontrol.detector.GoalViolations The name of the class that extends goal violations.
    anomaly.detection.goals For the list of available goals, see the Configuring goals section. The goals that the anomaly detector should detect if they are violated.
    goal.violation.detection.interval.ms value of anomaly.detection.interval.ms The interval in millisecond that goal violation detector will run to detect goal violations. If this interval time is not specified, goal violation detector will run with interval specified in anomaly.detection.interval.ms.
    Disk failure disk.failures.class com.linkedin.kafka.cruisecontrol.detector.DiskFailures The name of the class that extends disk failures anomaly.
    disk.failure.detection.interval.ms value of anomaly.detection.interval.ms The interval in millisecond that disk failure detector will run to detect disk failures. If this interval time is not specified, disk failure detector will run with interval specified in anomaly.detection.interval.ms.
    Topic anomaly topic.anomaly.detection.interval.ms value of anomaly.detection.interval.ms The interval in millisecond that topic anomaly detector will run to detect topic anomalies. If this interval time is not specified, topic anomaly detector will run with interval specified in anomaly.detection.interval.ms.
    topic.anomaly.finder.class com.linkedin.kafka.cruisecontrol.detector.NoopTopicAnomalyFinder A list of topic anomaly finder classes to find the current state to identify topic anomalies.
    Slow broker slow.broker.bytes.in.rate.detection.threshold 1024.0 The bytes in rate threshold in units of kilobytes per second to determine whether to include brokers in slow broker detection.
    slow.broker.log.flush.time.threshold.ms 1000.0 The log flush time threshold in units of millisecond to determine whether to detect a broker as a slow broker.
    slow.broker.metric.history.percentile.threshold 90.0 The percentile threshold used to compare the latest metric value against historical value in slow broker detection.
    slow.broker.metric.history.margin 3.0 The margin used to compare the latest metric value against historical value in slow broker detection.
    slow.broker.peer.metric.percentile.threshold 50.0 The percentile threshold used to compare last metric value against peers' latest value in slow broker detection.
    slow.broker.peer.metric.margin 10.0 The margin used to compare last metric value against peers' latest value in slow broker detection.
    slow.broker.demotion.score 5 The score threshold to trigger a demotion for slow brokers.
    slow.broker.decommission.score 50 The score threshold to trigger a removal for slow brokers.
    slow.broker.self.healing.unfixable.ratio 0.1 The maximum ratio of slow brokers in the cluster to trigger self-healing operation.
    Metric anomaly metric.anomaly.class com.linkedin.kafka.cruisecontrol.detector.KafkaMetricAnomaly The name of class that extends metric anomaly.
    metric.anomaly.detection.interval.ms value of anomaly.detection.interval.ms The interval in millisecond that metric anomaly detector will run to detect metric anomalies. If this interval time is not specified, the metric anomaly detector will run with the interval specified in anomaly.detection.interval.ms.
    Maintenance event maintenance.event.reader.class com.linkedin.kafka.cruisecontrol.detector.NoopMaintenanceEventReader A maintenance event reader class to retrieve maintenance events from the user-defined store.
    maintenance.event.class com.linkedin.kafka.cruisecontrol.detector.MaintenanceEvent The name of the class that extends the maintenance event.
    maintenance.event.enable.idempotence true The flag to indicate whether maintenance event detector will drop the duplicate maintenance events detected within the configured retention period.
    maintenance.event.idempotence.retention.ms 180000 The maximum time in ms to store events retrieved from the MaintenanceEventReader. Relevant only if idempotency is enabled (see maintenance.event.enable.idempotence).
    maintenance.event.max.idempotence.cache.size 25 The maximum number of maintenance events cached by the MaintenanceEventDetector within the past maintenance.event.idempotence.retention.ms ms. Relevant only if idempotency is enabled (see maintenance.event.enable.idempotence).
    maintenance.event.stop.ongoing.execution true The flag to indicate whether a maintenance event will gracefully stop the ongoing execution (if any) and wait until the execution stops before starting a fix for the anomaly.
    For more information about the Anomaly detector configurations, see the Cruise Control documentation.
  8. Click Save changes.
  9. Click on Action > Restart next to the Cruise Control service name to restart Cruise Control.

Enabling self-healing using REST API

  1. Open a command line tool.
  2. Use ssh and connect to your cluster running Cruise Control.
    ssh root@<your_hostname>

    You will be prompted to provide your password.

  3. Enable self-healing for the required anomaly types using the following POST command:
    POST /kafkacruisecontrol/admin?enable_self_healing_for=[anomaly_type]
    The following parameters must be used for anomaly_type:
    • GOAL_VIOLATION
    • BROKER_FAILURE
    • METRIC_ANOMALY
    • DISK_FAILURE
    • TOPIC_ANOMALY
  4. Check which anomalies are currently in use, and which are detected with the following GET command:
    GET /kafkacruisecontrol/state
When reviewing the state of Cruise Control, you can check the status of Anomaly Detector at the following parameters:
  • selfHealingEnabled - Anomaly type for which self-healing is enabled
  • selfHealingDisabled - Anomaly type for which self healing is disabled
  • recentGoalViolations - Recently detected goal violations
  • recentBrokerFailures - Recently detected broker failures
  • recentDiskFailures - Recently detected disk failures
  • recentMetricAnomalies - Recently detected metric anomalies