Apache Kudu Schema Design and Usage Limitations

Schema Design Limitations

Primary Key
  • The columns which make up the primary key must be listed first in the schema.

  • Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition. Additionally, all columns that are part of a primary key definition must be NOT NULL.

  • The primary key of a row cannot be modified using the UPDATE functionality. To modify a row’s primary key, the row must be deleted and re-inserted with the modified key. Such a modification is non-atomic.

  • Columns that are part of the primary key cannot be renamed. The primary key may not be changed after the table is created. You must drop and recreate a table to select a new primary key or rename key columns.

  • Auto-generated primary keys are not supported.

  • Cells making up a composite primary key are limited to a total of 16KB after internal composite-key encoding is done by Kudu.

Columns
  • By default, Kudu will not permit the creation of tables with more than 300 columns. We recommend schema designs that use fewer columns for best performance.

  • TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE, and complex types such as ARRAY are not supported.

  • Type, nullability, compression, and encoding of existing columns cannot be changed by altering the table.

Tables
  • Tables must have an odd number of replicas, with a maximum of 7.

  • Replication factor (set at table creation time) cannot be changed.

Cells

No individual cell may be larger than 64KB before encoding or compression. The cells making up a composite key are limited to a total of 16KB after the internal composite-key encoding done by Kudu. Inserting rows not conforming to these limitations will result in errors being returned to the client.

Rows

Kudu was primarily designed for analytic use cases. Although individual cells may be up to 64KB, and Kudu supports up to 300 columns, it is recommended that no single row be larger than a few hundred KB. You are likely to encounter issues if a single row contains multiple kilobytes of data.

Dropping Columns and Tables
  • Dropping a column does not immediately reclaim space. Compaction must run first.

  • There is no way to run compaction manually, but dropping the table will reclaim the space immediately.

Other Usage Limitations
  • Identifiers such as table and column names must be valid UTF-8 sequences and no longer than 256 bytes.

  • Secondary indexes are not supported.

  • Multi-row transactions are not supported.

  • Relational features, such as foreign keys, are not supported.

  • Identifiers such as column and table names are restricted to be valid UTF-8 strings. Additionally, a maximum length of 256 characters is enforced.

If you are using Apache Impala (incubating) to query Kudu tables, refer the section on Impala Integration Limitations as well.

Partitioning Limitations

  • Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Kudu does not allow you to change how a table is partitioned after creation, with the exception of adding or dropping range partitions. See Apache Kudu Schema Design for more information.

  • Data in existing tables cannot currently be automatically repartitioned. As a workaround, create a new table with the new partitioning and insert the contents of the old table.

  • Tablets that lose a majority of replicas (such as 1 left out of 3) require manual intervention to be repaired.

Scaling Recommendations and Limitations

  • Recommended maximum number of tablet servers is 100.

  • Recommended maximum number of masters is 3.

  • Recommended maximum amount of stored data, post-replication and post-compression, per tablet server is 4TB.

  • Recommended maximum number of tablets per tablet server is 1000, post-replication.

  • Maximum number of tablets per table for each tablet server is 60, post-replication, at table-creation time.

Server Management Limitations

  • Production deployments should configure a least 4GB of memory for tablet servers, and ideally more than 10GB.

  • Write ahead logs (WALs) can only be stored on one disk.

  • Disk failures are not tolerated and tablets servers will crash as soon as one is detected.

  • Failed disks with unrecoverable data requires formatting of all Kudu data for that tablet server before it can be started again.

  • Data directories cannot be added/removed; they must be reformatted to change the set of directories.

  • Tablet servers cannot be gracefully decommissioned.

  • Tablet servers cannot change their address or port.

  • Kudu has a hard requirement on having an up-to-date NTP. Kudu masters and tablet servers will crash when out of sync.

  • Kudu releases have only been tested with NTP. Other time synchronization providers such as Chrony may not work.

Cluster Management Limitations

  • Rack awareness is not supported.

  • Multi-datacenter is not supported.

  • Rolling restart is not supported.

Replication and Backup Limitations

  • Kudu does not currently include any built-in features for backup and restore. Users are encouraged to use tools such as Spark or Impala to export or import tables as necessary.

Impala Integration Limitations

  • When creating a Kudu table, the CREATE TABLE statement must include the primary key columns before other columns, in primary key order.

  • Impala cannot update values in primary key columns.

  • Impala cannot create Kudu tables with TIMESTAMP, DECIMAL, VARCHAR, or nested-typed columns.

  • Kudu tables with a name containing upper case or non-ASCII characters must be assigned an alternate name when used as an external table in Impala.

  • Kudu tables with a column name containing upper case or non-ASCII characters may not be used as an external table in Impala. Non-primary key columns may be renamed in Kudu to work around this issue.

  • Kudu tables containing UNIXTIME_MICROS-typed columns may not be used as an external table in Impala.

  • NULL, NOT NULL, !=, and LIKE predicates are not pushed to Kudu, and instead will be evaluated by the Impala scan node. This may decrease performance relative to other types of predicates.

  • Updates, inserts, and deletes using Impala are non-transactional. If a query fails part of the way through, its partial effects will not be rolled back.

  • The maximum parallelism of a single query is limited to the number of tablets in a table. For good analytic performance, aim for 10 or more tablets per host or use large tables.

Impala Keywords Not Supported for Creating Kudu Tables

  • PARTITIONED
  • LOCATION
  • ROWFORMAT

Spark Integration Limitations

  • Kudu tables with a name containing upper case or non-ASCII characters must be assigned an alternate name when registered as a temporary table.

  • Kudu tables with a column name containing upper case or non-ASCII characters must not be used with SparkSQL. Columns can be renamed in Kudu to work around this issue.

  • <> and ORpredicates are not pushed to Kudu, and instead will be evaluated by the Spark task. Only LIKE predicates with a suffix wildcard are pushed to Kudu. This means LIKE "FOO%" will be pushed, but LIKE "FOO%BAR" won't.

  • Kudu does not support all the types supported by Spark SQL. For example, Date, Decimal, and complex types are not supported on Kudu.

  • Kudu tables can only be registered as temporary tables in SparkSQL.

  • Kudu tables cannot be queried using HiveContext.

Security Limitations

  • Disk encryption is not built in. Kudu has been reported to run correctly on systems using local block device encryption (e.g. dmcrypt).

  • Authorization is only available at a system-wide, coarse-grained level. Kudu does not have the ability to restrict access based on the type of operation, or the target (table, column, row, etc). ACLs do not currently support authorization based on membership in a group.

  • Kudu does not support configuring a custom service principal for Kudu processes. The principal must follow the pattern kudu/<HOST>@<DEFAULT.REALM>.

  • Kudu does not support externally-issued certificates for internal wire encryption (server to server and client to server).

  • Kudu integration with Apache Flume does not support writing to Kudu clusters that require authentication or encryption.

  • Kudu client instances retrieve authentication tokens upon first contact with the cluster. However, these tokens expire after one week and Kudu clients do not automatically request fresh tokens after initial token expiration. Therefore, Kudu clients that are active for more than a week are not supported.

    Use of a single Kudu client instance for more than one week is only supported by the C++ client, not by the Java client.

    Note that applications such as Apache Impala (incubating) construct new clients for each query. Therefore, this limitation only affects the runtime of a single query.

Other Known Issues

The following are known bugs and issues with the current release of Kudu. They will be addressed in later releases. Note that this list is not exhaustive, and is meant to communicate only the most important known issues.

Timeout Possible with Log Force Synchronization Option

If the Kudu master is configured with the -log_force_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.

Longer Startup Times with a Large Number of Tablets

If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 100 or fewer. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.

Confusing Descriptions for Kudu TLS/SSL Settings in Cloudera Manager

Descriptions in the Cloudera Manager Admin Console for TLS/SSL settings are confusing, and will be replaced in a future release. The correct information about the settings is in the Usage Notes column:

Field Usage Notes
Kerberos Principal Set to the default principal, kudu.
Enable Secure Authentication And Encryption Select this checkbox to enable authentication and RPC encryption between all Kudu clients and servers, as well as between individual servers. Only enable this property after you have configured Kerberos.
Master TLS/SSL Server Private Key File (PEM Format) Set to the path containing the Kudu master host's private key (PEM-format). This is used to enable TLS/SSL encryption (over HTTPS) for browser-based connections to the Kudu master web UI.
Tablet Server TLS/SSL Server Private Key File (PEM Format) Set to the path containing the Kudu tablet server host's private key (PEM-format). This is used to enable TLS/SSL encryption (over HTTPS) for browser-based connections to Kudu tablet server web UIs.
Master TLS/SSL Server Certificate File (PEM Format) Set to the path containing the signed certificate (PEM-format) for the Kudu master host's private key (set in Master TLS/SSL Server Private Key File). The certificate file can be created by concatenating all the appropriate root and intermediate certificates required to verify trust.
Tablet Server TLS/SSL Server Certificate File (PEM Format) Set to the path containing the signed certificate (PEM-format) for the Kudu tablet server host's private key (set in Tablet Server TLS/SSL Server Private Key File). The certificate file can be created by concatenating all the appropriate root and intermediate certificates required to verify trust.
Master TLS/SSL Server CA Certificate (PEM Format) Disregard this field.
Tablet Server TLS/SSL Server CA Certificate (PEM Format) Disregard this field.
Enable TLS/SSL for Master Server Enables HTTPS encryption on the Kudu master web UI.
Enable TLS/SSL for Tablet Server Enables HTTPS encryption on the Kudu tablet server web UIs.