Condition
When attempting to downgrade or resize executor nodes in a Cloudera on cloud cluster, if any
node hosts the Kudu tablet server role, you might encounter the following error:
'KUDU_TSERVER' service is not enabled for scaling down
This error typically occurs during node decommissioning due to the presence
of the tablet server role on one or more nodes.
Cause
Decommissioning a node hosting a Kudu tablet server requires careful tablet
migration. Directly removing or resizing such a node is not possible due to
hosted tablet replicas. To safely remove a tablet server, it must first be
placed in maintenance mode, followed by a tablet rebalancing operation to move
its replicas.
Failing to remove a tablet server can result in data loss, cluster
inconsistency, and errors during scaling-down operations, as Kudu will prevent
the removal of nodes with active tablet servers. You must perform manual
intervention to move tablets off the server scheduled for decommissioning.
Soultion
-
Perform Kerberos authentication if applicable.
If your cluster uses Kerberos, run kinit
with Kudu's
keytab before executing any Kudu CLI commands. You do not need to use
sudo
.
-
Perform prechecks before scale-down actions.
Before initiating any scale-down actions, the following preconditions
must be met:
- The node being decommissioned cannot run a KUDU_MASTER
role.
- Before commencing the decommissioning of a Kudu tablet server, a
Kudu cluster health check must be performed, for example, by
checking if the
kudu cluster ksck
command
returns a success exit code.
- As an implementation constraint, Kudu clusters can only be
scaled down by decommissioning one tablet server at a time. If
multiple nodes need to be removed, they must be decommissioned
one at a time.
- The decommissioned tablet server cannot host any non-replicated
(RF=1) tablets. As an alternative, Cloudera recommends
enforcing table creation with at least RF=3 by customizing the
min_num_replicas
master flag.
- If the tablet server contains any replicas of tables
with a replication factor of 1, these replicas must be
manually moved off the tablet server prior to shutting
the tablet down. This can be achieved using the Kudu
tablet
change_config move_replica
tool.
-
Decommission the tablet server.
- Put the tablet server into maintenance mode by using
the
kudu tserver state enter_maintenance
tool.
- Run the
kudu cluster rebalance
tool,
supplying the --ignored_tservers
argument with
the UUIDs of the tablet servers to be decommissioned and the
--move_replicas_from_ignored_tservers
flag.
- Wait for the moves to complete and for
ksck
to
show the cluster in a healthy state.
- Ensure that
ksck
reports a healthy cluster and
no tablets on this tablet server.
- Once the above steps are successfully completed, the
decommissioned tablet server can be brought offline.
- After the tablet server and the decommissioned node have been
stopped, unregister the tablet server from the Kudu cluster by
running the following command:
kudu tserver unregister <master_addresses> <tserver_uuid>
-
Perform a post-scale-down health check.
Run the kudu cluster
ksck
tool to confirm the Kudu
cluster health. If the cluster is healthy, report a success. If the
cluster is not healthy, report on the issue. For more information, see
Check the health of a Kudu cluster.