What's New in Apache Kudu

This topic lists new features for Apache Kudu in this release of Cloudera Runtime.

Support for putting tablet servers in to maintenance mode

Kudu supports putting tablet servers into maintenance. In this mode, the tablet server's replicas are not re-replicated if they fail. Re-replication for the remaining, under-replicated tablets is triggered only when you exit from the maintenance mode. The kudu tserver state enter_maintenance and the kudu tserver state exit_maintenance tools have been added to orchestrate the tablet server maintenance. The kudu tserver list tool has been amended with a "state" column option to display the current state of each tablet server.

Built-in NTP client maintains internal time

Kudu has a built-in NTP client which maintains an internal time that is used to generate the HybridTime timestamps. If you enable the NTP client, the system clock synchronization for the nodes running Kudu is no longer necessary. This is useful for containerized deployments and in other cases when it is troublesome to maintain a properly configured system NTP service at each node of a Kudu cluster. The list of NTP servers to synchronize against is specified with the --builtin_ntp_servers flag. By default, the Kudu masters and the tablet servers use public servers hosted by the NTP Pool project. To use the built-in NTP client, set the --time_source=builtin flag and reconfigure the --builtin_ntp_servers flag if necessary.

Support for aggregated table statistics for the Kudu clients

Aggregated table statistics are now available to the Kudu clients through the KuduClient.getTableStatistics() and the KuduTable.getTableStatistics() methods in the Kudu Java client, and by using the KuduClient.GetTableStatistics() method in the Kudu C++ client. This allows for various query optimizations.

For example, Spark now uses the aggregrated table statistics to perform join optimizations. The statistics are available via the API of both the C++ and the Java Kudu clients. In addition, per-table statistics are available via the kudu table statistics CLI tool. The statistics are also available via the master's Web UI at the master:8051/metrics and the master:8051/table?id=<uuid> URIs.

New operations using the Kudu CLI

The Kudu CLI supports the following new operations:
  • Altering table columns: The following, newly introduced sub-commands allow you to alter a column of the specified table:
    • kudu table column_set_default
    • kudu table column_remove_default
    • kudu table column_set_compression
    • kudu table column_set_encoding
    • kudu table column_set_block_size
  • Dropping table columns: The kudu table delete_column sub-command allows you to drop a column of the specified table.
  • Getting and setting extra configuration properties: The kudu table get_extra_configs and the kudu table set_extra_config sub-commands allow you to get and set the extra extra configuration properties for a table respectively.
  • Creating and dropping range partitions: The kudu table add_range_partition and the kudu table drop_range_partition sub-commands allow you to create and drop the range partitions for a table respectively.

Optimizations and improvements

  • Tablet servers now expand a tablet's data directory group with available healthy directories when all directories of the group are full.
  • For scan operations that are run with the CLOSEST_REPLICA selection mode, the Kudu Java client now picks a random available replica in case no replica is located at the same node with the client that initiated the scan operation. This helps to spread the load generated by multiple scan requests to the same tablet among all available replicas. In older versions of Kudu, all such scan requests would end up fetching data from the same tablet replica.
  • The tablet servers now consider the available disk space when choosing a set of data directories for a tablet's data directory group, and when deciding in which data directory a new block should be written.
  • The tablet servers reject any individual write operations that violate the schema constraints in a batch of write operations that are received from a client. The previous behavior was to reject the whole batch of write operations if a violation of the schema constraints is detected even for a single row.
  • Kudu RPC now enables TCP keepalive for all outbound connections for faster detection of the no-longer-reachable nodes.
  • The memory reserved by tcmalloc is now released to the OS periodically to avoid any potential OOM issues in case of read-only workloads.
  • The evaluation of predicates on columns of primitive types and NULL or NOT NULL predicates has been optimized to leverage SIMD instructions.