Error during Kudu Tablet Server node downgrade or resize in Cloudera on cloud cluster

Condition

When attempting to downgrade or resize executor nodes in a Cloudera on cloud cluster, if any node hosts the Kudu tablet server role, you might encounter the following error:

'KUDU_TSERVER' service is not enabled for scaling down

This error typically occurs during node decommissioning due to the presence of the tablet server role on one or more nodes.

Cause

Decommissioning a node hosting a Kudu tablet server requires careful tablet migration. Directly removing or resizing such a node is not possible due to hosted tablet replicas. To safely remove a tablet server, it must first be placed in maintenance mode, followed by a tablet rebalancing operation to move its replicas.

Failing to remove a tablet server can result in data loss, cluster inconsistency, and errors during scaling-down operations, as Kudu will prevent the removal of nodes with active tablet servers. You must perform manual intervention to move tablets off the server scheduled for decommissioning.

Soultion

  1. Perform Kerberos authentication if applicable.
    If your cluster uses Kerberos, run kinit with Kudu's keytab before executing any Kudu CLI commands. You do not need to use sudo.
  2. Perform prechecks before scale-down actions.
    Before initiating any scale-down actions, the following preconditions must be met:
    • The node being decommissioned cannot run a KUDU_MASTER role.
    • Before commencing the decommissioning of a Kudu tablet server, a Kudu cluster health check must be performed, for example, by checking if the kudu cluster ksck command returns a success exit code.
    • As an implementation constraint, Kudu clusters can only be scaled down by decommissioning one tablet server at a time. If multiple nodes need to be removed, they must be decommissioned one at a time.
    • The decommissioned tablet server cannot host any non-replicated (RF=1) tablets. As an alternative, Cloudera recommends enforcing table creation with at least RF=3 by customizing the min_num_replicas master flag.
      • If the tablet server contains any replicas of tables with a replication factor of 1, these replicas must be manually moved off the tablet server prior to shutting the tablet down. This can be achieved using the Kudu tablet change_config move_replica tool.
  3. Decommission the tablet server.
    1. Put the tablet server into maintenance mode by using the kudu tserver state enter_maintenance tool.
    2. Run the kudu cluster rebalance tool, supplying the --ignored_tservers argument with the UUIDs of the tablet servers to be decommissioned and the --move_replicas_from_ignored_tservers flag.
    3. Wait for the moves to complete and for ksck to show the cluster in a healthy state.
    4. Ensure that ksck reports a healthy cluster and no tablets on this tablet server.
    5. Once the above steps are successfully completed, the decommissioned tablet server can be brought offline.
    6. After the tablet server and the decommissioned node have been stopped, unregister the tablet server from the Kudu cluster by running the following command:
      kudu tserver unregister <master_addresses> <tserver_uuid>
      
  4. Perform a post-scale-down health check.
    Run the kudu cluster ksck tool to confirm the Kudu cluster health. If the cluster is healthy, report a success. If the cluster is not healthy, report on the issue. For more information, see Check the health of a Kudu cluster.