Recover from disk failure
Kudu nodes can only survive failures of disks on which certain Kudu directories are mounted. For more information about the different Kudu directory types, see the Directory configuration topic.
|Node Type||Kudu directory type||Kudu releases that crash on disk failure|
|Tablet Server||Directory containing WALs||All|
|Tablet Server||Directory containing tablet metadata||All|
|Tablet Server||Directory containing data blocks only||Pre-1.6.0|
When a disk failure occurs that does not lead to a crash, Kudu will stop using the affected directory, shut down tablets with blocks on the affected directories, and automatically re-replicate the affected tablets to other tablet servers. The affected server will remain alive and print messages to the log indicating the disk failure, for example:
E1205 19:06:24.163748 27115 data_dirs.cc:1011] Directory /data/8/kudu/data marked as failed E1205 19:06:30.324795 27064 log_block_manager.cc:1822] Not using report from /data/8/kudu/data: IO error: Could not open container 0a6283cab82d4e75848f49772d2638fe: /data/8/kudu/data/0a6283cab82d4e75848f49772d2638fe.metadata: Read-only file system (error 30) E1205 19:06:33.564638 27220 ts_tablet_manager.cc:946] T 4957808439314e0d97795c1394348d80 P 70f7ee61ead54b1885d819f354eb3405: aborting tablet bootstrap: tablet has data in a failed directory
While in this state, the affected node will avoid using the failed disk, leading to lower
storage volume and reduced read parallelism. The administrator can remove the failed
directory from the
--fs_data_dirs gflag to avoid seeing these errors.
When the disk is repaired, remounted, and ready to be reused by Kudu, take the following steps:
- Make sure that the Kudu portion of the disk is completely empty.
- Stop the tablet server.
--fs_data_dirsgflag to add /data/3.
- Start the tablet server.
ksckto verify cluster health. For example:
$ sudo -u kudu kudu cluster ksck master-01.example.com