Orchestrate a rolling restart with no downtime

Kudu 1.12 provides tooling to restart a cluster with no downtime. This topic provides the steps to perform rolling restart.

Cloudera Manager can automate this process, by using the “Rolling Restart” command on the Kudu service.

Cloudera Manager will prompt you to specify how many tablet servers to restart concurrently. If running with rack awareness with and at least three racks specified across all hosts that contain Kudu roles, it is safe to specify the restart batch with up to one rack at a time, provided the rack assignment policy is being enforced.

The following service configurations can be set to tune the parameters the rolling restart will run with:
  • Rolling Restart Health Check Interval: the interval in seconds that Cloudera Manager will run ksck after restarting a batch of tablet servers, waiting for the cluster to become healthy.
  • Maximum Allowed Runtime to Rolling Restart a Batch of Servers: the total amount of time in seconds Cloudera Manager will wait for the cluster to become healthy after restarting a batch of tablet servers, before exiting with an error.
  1. Restart the master(s) one-by-one. If there is only a single master, this may cause brief interference with on-going workloads.
  2. Starting with a single tablet server, put the tablet server into maintenance mode by using the kudu tserver state enter_maintenance tool.
  3. Start quiescing the tablet server using the kudu tserver quiesce start tool. This signals Kudu to stop hosting leaders on the specified tablet server and to redirect new scan requests to other tablet servers.
  4. Periodically run kudu tserver quiesce start with the --error_if_not_fully_quiesced option, until it returns success, indicating that all leaders have been moved away from the tablet server and that all on-going scans have completed.
  5. Restart the tablet server.
  6. Periodically run ksck until the cluster ireports a healthy status.
  7. Exit maintenance mode on the tablet server by running kudu tserver state exit_maintenance. This allows new tablet replicas to be placed on the tablet server.
  8. Repeat these steps for all tablet servers in the cluster.