Common Kudu workflows
Some common Kudu administrative tasks include migrating to multiple Kudu masters,
recovering from a dead Kudu master, removing unwanted masters from a multi-master deployment,
adding or updating hostnames of the masters within the clusters without aliases, monitoring
the health of the cluster using ksck, changing directory configuration, recovering from disk
failures, bringing a tablet that has lost a majority of replicas back online, rebuilding a
Kudu filesystem layout, taking physical backups of an entire node, and scale the storage for
the Kudu master and the tablet severs in the cloud.
Migrating to multiple Kudu masters To provide high availability and to avoid a single point of failure, Kudu clusters should be created with multiple masters. Many Kudu clusters were created with just a single master, either for simplicity or because Kudu multi-master support was still experimental at the time. This workflow demonstrates how to migrate to a multi-master configuration. It can also be used to migrate from two masters to three with straightforward modifications.Recovering from a dead Kudu master in a multi-master deployment Kudu multi-master deployments function normally in the event of a master loss. However, it is important to replace the dead master. Otherwise a second failure may lead to a loss of availability, depending on the number of available masters. This workflow describes how to replace the dead master. Removing Kudu masters from a multi-master deployment In the event that a multi-master deployment has been overallocated nodes, the following steps should be taken to remove the unwanted masters.Changing master hostnames When replacing dead masters, use DNS aliases to prevent long maintenance windows. If the cluster was set up without aliases, change the host names as described in this section.Best practices when adding new tablet servers A common workflow when administering a Kudu cluster is adding additional tablet server instances, in an effort to increase storage capacity, decrease load or utilization on individual hosts, increase compute power, and more.Monitoring cluster health with ksck The kudu
CLI includes a tool called ksck
that can be used for gathering information about the state of a Kudu cluster, including checking its health. ksck
will identify issues such as under-replicated tablets, unreachable tablet servers, or tablets without a leader. Orchestrating a rolling restart with no downtime Kudu 1.12 provides tooling to restart a cluster with no downtime. This topic provides the steps to perform rolling restart.Changing directory configuration For higher read parallelism and larger volumes of storage per server, you may want to configure servers to store data in multiple directories on different devices. You can add or remove data directories to an existing master or tablet server by updating the --fs_data_dirs
Gflag configuration and restarting the server. Data is striped across data directories, and when a new data directory is added, new data will be striped across the union of the old and new directories. Recovering from disk failure Kudu nodes can only survive failures of disks on which certain Kudu directories are mounted. For more information about the different Kudu directory types, see the Directory configuration topic.Recovering from full disks By default, Kudu reserves a small amount of space, 1% by capacity, in its directories. Kudu considers a disk full if there is less free space available than the reservation. Kudu nodes can only tolerate running out of space on disks on which certain Kudu directories are mounted.Bringing a tablet that has lost a majority of replicas back online If a tablet has permanently lost a majority of its replicas, it cannot recover automatically and operator intervention is required. If the tablet servers hosting a majority of the replicas are down (i.e. ones reported as "TS unavailable" by ksck
), they should be recovered instead if possible.Rebuilding a Kudu filesystem layout In the event that critical files are lost, i.e. WALs or tablet-specific metadata, all Kudu directories on the server must be deleted and rebuilt to ensure correctness. Doing so will destroy the copy of the data for each tablet replica hosted on the local server. Kudu will automatically re-replicate tablet replicas removed in this way, provided the replication factor is at least three and all other servers are online and healthy.Physical backups of an entire node Kudu does not yet provide any built-in backup and restore functionality. However, it is possible to create a physical backup of a Kudu node, either tablet server or master, and restore it later.Scaling storage on Kudu master and tablet servers in the cloud If you find that the size of your Kudu cloud deployment has exceeded previous expectations, or you simply wish to allocate more storage to Kudu, use the following set of high-level steps as a guide to increasing storage on your Kudu master or tablet server hosts. You must work with your cluster's Hadoop administrators and the system administrators to complete this process. Replace the file paths in the following steps to those relevant to your setup.