Orchestrate a rolling restart with no downtime
Kudu 1.12 provides tooling to restart a cluster with no downtime. This topic provides the steps to perform rolling restart.
Cloudera Manager can automate this process, by using the “Rolling Restart” command on the Kudu service.
Cloudera Manager will prompt you to specify how many tablet servers to restart concurrently. If running with rack awareness with and at least three racks specified across all hosts that contain Kudu roles, it is safe to specify the restart batch with up to one rack at a time, provided the rack assignment policy is being enforced.
The following service configurations can be set to tune the parameters the rolling restart will run with:
- Rolling Restart Health Check Interval: the interval in seconds that
Cloudera Manager will run
ksckafter restarting a batch of tablet servers, waiting for the cluster to become healthy.
- Maximum Allowed Runtime to Rolling Restart a Batch of Servers: the total amount of time in seconds Cloudera Manager will wait for the cluster to become healthy after restarting a batch of tablet servers, before exiting with an error.
- Restart the master(s) one-by-one. If there is only a single master, this may cause brief interference with on-going workloads.
Starting with a single tablet server, put the tablet server into maintenance
mode by using the
kudu tserver state enter_maintenancetool.
Start quiescing the tablet server using the
kudu tserver quiesce starttool. This signals Kudu to stop hosting leaders on the specified tablet server and to redirect new scan requests to other tablet servers.
kudu tserver quiesce startwith the
--error_if_not_fully_quiescedoption, until it returns success, indicating that all leaders have been moved away from the tablet server and that all on-going scans have completed.
- Restart the tablet server.
ksckuntil the cluster ireports a healthy status.
Exit maintenance mode on the tablet server by running
kudu tserver state exit_maintenance. This allows new tablet replicas to be placed on the tablet server.
- Repeat these steps for all tablet servers in the cluster.