What's New in Apache Kudu
This topic lists new features for Apache Kudu in this release of Cloudera Runtime.
Support for putting tablet servers in to maintenance mode
Kudu supports putting tablet servers into maintenance. In this mode, the tablet server's
replicas are not re-replicated if they fail. Re-replication for the remaining, under-replicated
tablets is triggered only when you exit from the maintenance mode. The kudu tserver
state enter_maintenance
and the kudu tserver state exit_maintenance
tools have been added to orchestrate the tablet server maintenance. The kudu tserver
list
tool has been amended with a "state" column option to display the current state
of each tablet server.
Built-in NTP client maintains internal time
Kudu has a built-in NTP client which maintains an internal time that is used to generate the
HybridTime timestamps. If you enable the NTP client, the system clock synchronization for the
nodes running Kudu is no longer necessary. This is useful for containerized deployments and in
other cases when it is troublesome to maintain a properly configured system NTP service at each
node of a Kudu cluster. The list of NTP servers to synchronize against is specified with the
--builtin_ntp_servers
flag. By default, the Kudu masters and the tablet
servers use public servers hosted by the NTP Pool project. To use the built-in NTP client, set
the --time_source=builtin
flag and reconfigure the
--builtin_ntp_servers
flag if necessary.
Support for aggregated table statistics for the Kudu clients
Aggregated table statistics are now available to the Kudu clients through the
KuduClient.getTableStatistics()
and the
KuduTable.getTableStatistics()
methods in the Kudu Java client, and by using
the KuduClient.GetTableStatistics()
method in the Kudu C++ client. This allows
for various query optimizations.
For example, Spark now uses the aggregrated table statistics to perform join optimizations.
The statistics are available via the API of both the C++ and the Java Kudu clients. In addition,
per-table statistics are available via the kudu table statistics
CLI tool. The
statistics are also available via the master's Web UI at the
master:8051/metrics
and the master:8051/table?id=<uuid>
URIs.
New operations using the Kudu CLI
- Altering table columns: The following, newly introduced sub-commands allow you to
alter a column of the specified table:
kudu table column_set_default
kudu table column_remove_default
kudu table column_set_compression
kudu table column_set_encoding
kudu table column_set_block_size
- Dropping table columns: The
kudu table delete_column
sub-command allows you to drop a column of the specified table. - Getting and setting extra configuration properties: The
kudu table get_extra_configs
and thekudu table set_extra_config
sub-commands allow you to get and set the extra extra configuration properties for a table respectively. - Creating and dropping range partitions: The
kudu table add_range_partition
and thekudu table drop_range_partition
sub-commands allow you to create and drop the range partitions for a table respectively.
Optimizations and improvements
- Tablet servers now expand a tablet's data directory group with available healthy directories when all directories of the group are full.
- For scan operations that are run with the
CLOSEST_REPLICA
selection mode, the Kudu Java client now picks a random available replica in case no replica is located at the same node with the client that initiated the scan operation. This helps to spread the load generated by multiple scan requests to the same tablet among all available replicas. In older versions of Kudu, all such scan requests would end up fetching data from the same tablet replica. - The tablet servers now consider the available disk space when choosing a set of data directories for a tablet's data directory group, and when deciding in which data directory a new block should be written.
- The tablet servers reject any individual write operations that violate the schema constraints in a batch of write operations that are received from a client. The previous behavior was to reject the whole batch of write operations if a violation of the schema constraints is detected even for a single row.
- Kudu RPC now enables TCP keepalive for all outbound connections for faster detection of the no-longer-reachable nodes.
- The memory reserved by tcmalloc is now released to the OS periodically to avoid any potential OOM issues in case of read-only workloads.
- The evaluation of predicates on columns of primitive types and
NULL
orNOT NULL
predicates has been optimized to leverage SIMD instructions.