Corruption: checksum error on CFile block
In versions prior to Kudu 1.8.0, if the data on disk becomes corrupt, you will encounter warnings containing "Corruption: checksum error on CFile block" in the tablet server logs and client side errors when trying to scan tablets with corrupt CFile blocks. Fixing this corruption is a manual process.
To fix the issue, first identify all the affected tablets by running a checksum scan on the affected tables or tablets using the ksck tool.
sudo -u kudu kudu cluster ksck <master_addresses> -checksum_scan -tables=<tables>
sudo -u kudu kudu cluster ksck <master_addresses> -checksum_scan -tablets=<tablets>
If there
is at least one replica for each tablet that does not return a corruption error, you can
repair the bad copies by deleting them and forcing them to be re-replicated from the leader
using the remote_replica
delete tool.
sudo -u kudu kudu remote_replica delete <tserver_address> <tablet_id> "Cfile Corruption"
If all of the replica are corrupt, then some data loss has occurred. Until KUDU-2526 is completed, this can happen if the corrupt replica became the leader and the existing follower replicas are replaced.
If data
has been lost, you can repair the table by replacing the corrupt tablet with an empty one
using the unsafe_replace_tablet
tool.
sudo -u kudu kudu tablet unsafe_replace_tablet <master_addresses> <tablet_id>
From versions 1.8.0 onwards, Kudu will mark the affected replicas as failed, leading to their automatic re-replication elsewhere.