Monitoring cluster health with ksck
The kudu
CLI includes a tool called
ksck
that can be used for gathering
information about the state of a Kudu cluster, including checking its health. ksck
will identify issues such as under-replicated
tablets, unreachable tablet servers, or tablets without a leader.
ksck
should be run from the command line as the Kudu admin user, and
requires the full list of master addresses to be specified:
$ sudo -u kudu kudu cluster ksck master-01.example.com,master-02.example.com,master-03.example.com
To see a full list of the options available with ksck
, use the
--help
flag. If the cluster is healthy, ksck
will
print information about the cluster, a success message, and return a zero (success)
exit status.
Master Summary
UUID | Address | Status
----------------------------------+-----------------------+---------
a811c07b99394df799e6650e7310f282 | master-01.example.com | HEALTHY
b579355eeeea446e998606bcb7e87844 | master-02.example.com | HEALTHY
cfdcc8592711485fad32ec4eea4fbfcd | master-02.example.com | HEALTHY
Tablet Server Summary
UUID | Address | Status
----------------------------------+------------------------+---------
a598f75345834133a39c6e51163245db | tserver-01.example.com | HEALTHY
e05ca6b6573b4e1f9a518157c0c0c637 | tserver-02.example.com | HEALTHY
e7e53a91fe704296b3a59ad304e7444a | tserver-03.example.com | HEALTHY
Version Summary
Version | Servers
---------+-------------------------
1.7.1 | all 6 server(s) checked
Summary by table
Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated | Unavailable
----------+----+---------+---------------+---------+------------+------------------+-------------
my_table | 3 | HEALTHY | 8 | 8 | 0 | 0 | 0
| Total Count
----------------+-------------
Masters | 3
Tablet Servers | 3
Tables | 1
Tablets | 8
Replicas | 24
OK
If the cluster is unhealthy, for instance if a tablet server process has stopped,
ksck
will report the issue(s) and return a non-zero exit status, as
shown in the abbreviated snippet of ksck
output below:
Tablet Server Summary
UUID | Address | Status
----------------------------------+------------------------+-------------
a598f75345834133a39c6e51163245db | tserver-01.example.com | HEALTHY
e05ca6b6573b4e1f9a518157c0c0c637 | tserver-02.example.com | HEALTHY
e7e53a91fe704296b3a59ad304e7444a | tserver-03.example.com | UNAVAILABLE
Error from 127.0.0.1:7150: Network error: could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7150: connect: Connection refused (error 61) (UNAVAILABLE)
... (full output elided)
------------------
Errors:
------------------
Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors
Corruption: table consistency check error: 1 out of 1 table(s) are not healthy
FAILED
Runtime error: ksck discovered errors
To verify data integrity, the optional --checksum_scan
flag can be
set, which will ensure the cluster has consistent data by scanning each tablet replica
and comparing results. The --tables
or --tablets
flags can be used to limit the scope of the checksum scan to specific tables or
tablets, respectively.
For example, checking data integrity on the my_table
table can be
done with the following command:
$ sudo -u kudu kudu cluster ksck --checksum_scan --tables my_table master-01.example.com,master-02.example.com,master-03.example.com
By default, ksck
will attempt to use a snapshot scan of the table, so
the checksum scan can be done while writes continue.
Finally, ksck
also supports output in JSON format using the
--ksck_format
flag. JSON output contains the same information as the
plain text output, but in a format that can be used by other tools. See kudu
cluster ksck --help
for more information.