Monitoring cluster health with ksck

The kudu CLI includes a tool called ksck that can be used for gathering information about the state of a Kudu cluster, including checking its health. ksck will identify issues such as under-replicated tablets, unreachable tablet servers, or tablets without a leader.

ksck should be run from the command line as the Kudu admin user, and requires the full list of master addresses to be specified:

$ sudo -u kudu kudu cluster ksck master-01.example.com,master-02.example.com,master-03.example.com

To see a full list of the options available with ksck, use the --help flag. If the cluster is healthy, ksck will print information about the cluster, a success message, and return a zero (success) exit status.

Master Summary
               UUID               |       Address         | Status
----------------------------------+-----------------------+---------
 a811c07b99394df799e6650e7310f282 | master-01.example.com | HEALTHY
 b579355eeeea446e998606bcb7e87844 | master-02.example.com | HEALTHY
 cfdcc8592711485fad32ec4eea4fbfcd | master-02.example.com | HEALTHY

Tablet Server Summary
               UUID               |        Address         | Status
----------------------------------+------------------------+---------
 a598f75345834133a39c6e51163245db | tserver-01.example.com | HEALTHY
 e05ca6b6573b4e1f9a518157c0c0c637 | tserver-02.example.com | HEALTHY
 e7e53a91fe704296b3a59ad304e7444a | tserver-03.example.com | HEALTHY

Version Summary
 Version |      Servers
---------+-------------------------
  1.7.1  | all 6 server(s) checked

Summary by table
   Name   | RF | Status  | Total Tablets | Healthy | Recovering | Under-replicated | Unavailable
----------+----+---------+---------------+---------+------------+------------------+-------------
 my_table | 3  | HEALTHY | 8             | 8       | 0          | 0                | 0

                | Total Count
----------------+-------------
 Masters        | 3
 Tablet Servers | 3
 Tables         | 1
 Tablets        | 8
 Replicas       | 24
OK

If the cluster is unhealthy, for instance if a tablet server process has stopped, ksck will report the issue(s) and return a non-zero exit status, as shown in the abbreviated snippet of ksck output below:

Tablet Server Summary
               UUID               |        Address         |   Status
----------------------------------+------------------------+-------------
 a598f75345834133a39c6e51163245db | tserver-01.example.com | HEALTHY
 e05ca6b6573b4e1f9a518157c0c0c637 | tserver-02.example.com | HEALTHY
 e7e53a91fe704296b3a59ad304e7444a | tserver-03.example.com | UNAVAILABLE
Error from 127.0.0.1:7150: Network error: could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7150: connect: Connection refused (error 61) (UNAVAILABLE)

... (full output elided)

------------------
Errors:
------------------
Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors
Corruption: table consistency check error: 1 out of 1 table(s) are not healthy

FAILED
Runtime error: ksck discovered errors

To verify data integrity, the optional --checksum_scan flag can be set, which will ensure the cluster has consistent data by scanning each tablet replica and comparing results. The --tables or --tablets flags can be used to limit the scope of the checksum scan to specific tables or tablets, respectively.

For example, checking data integrity on the my_table table can be done with the following command:

$ sudo -u kudu kudu cluster ksck --checksum_scan --tables my_table master-01.example.com,master-02.example.com,master-03.example.com

By default, ksck will attempt to use a snapshot scan of the table, so the checksum scan can be done while writes continue.

Finally, ksck also supports output in JSON format using the --ksck_format flag. JSON output contains the same information as the plain text output, but in a format that can be used by other tools. See kudu cluster ksck --help for more information.