Planning for Apache Kudu

The perfect schema

The perfect schema would accomplish the following:

Data would be distributed such that reads and writes are spread evenly across tablet servers. This can be achieved by effective partitioning.
Tablets would grow at an even, predictable rate, and load across tablets would remain steady over time. This can be achieved by effective partitioning.
Scans would read the minimum amount of data necessary to fulfill a query. This is impacted mostly by primary key design, but partitioning also plays a role via partition pruning.

The perfect schema depends on the characteristics of your data, what you need to do with it, and the topology of your cluster. Schema design is the single most important thing within your control to maximize the performance of your Kudu cluster.

We want your opinion

How can we improve this page?

What kind of feedback do you have?