Minimizing cluster disruption during temporary planned downtime of a single tablet server

If a single tablet server is brought down temporarily in a healthy cluster, all tablets will remain available and clients will function as normal, after potential short delays due to leader elections. However, if the downtime lasts for more than --follower_unavailable_considered_failed_sec (default 300) seconds, the tablet replicas on the down tablet server will be replaced by new replicas on available tablet servers. This will cause stress on the cluster as tablets re-replicate and, if the downtime lasts long enough, significant reduction in the number of replicas on the down tablet server. This may require the rebalancer to fix.

To work around this, increase --follower_unavailable_considered_failed_sec on all tablet servers so the amount of time before re-replication will start is longer than the expected downtime of the tablet server, including the time it takes the tablet server to restart and bootstrap its tablet replicas. To do this, run the following command on each tablet server:

$ sudo -u kudu kudu tserver set_flag <tserver_address> follower_unavailable_considered_failed_sec <num_seconds>

where <num_seconds> is the number of seconds that will encompass the downtime. Once the downtime is finished, reset the flag to its original value.

$ sudo -u kudu kudu tserver set_flag <tserver_address> follower_unavailable_considered_failed_sec <original_value>

In Kudu versions 1.7 and lower, the --force flag must be provided in the above commands.

Minimizing cluster disruption during temporary planned downtime of a single tablet server

We want your opinion

How can we improve this page?