Handling disk failures

An overview on how to handle disk failures.

Cloudera Manager has built in monitoring functionalities that automatically trigger alerts when disk failures are detected. When a log directory fails, Kafka also detects the failure and takes the partitions stored in that directory offline. The cause of disk failures can be analyzed with the help of the kafka-log-dirs tool, or by reviewing the error messages of KafkaStorageException entries in the Kafka broker log file. To access the log file go to Instances > Log Files > Role Log File.
In case of a disk failure, a Kafka administrator can carry out either of the following actions. The action taken depends on the failure type and system environment:
  • Replace the faulty disk with a new one.
  • Remove the disk and redistribute data across remaining disks to restore the desired replication factor.