Partitioning tables
Tables are partitioned into tablets according to a partition schema on the primary key columns. Each tablet is served by at least one tablet server. Ideally, a table should be split into tablets that are distributed across a number of tablet servers to maximize parallel operations. The details of the partitioning schema you use will depend entirely on the type of data you store and how you access it.
Kudu currently has no mechanism for splitting or merging tablets after the table has been created. Until this feature has been implemented, you must provide a partition schema for your table when you create it. When designing your tables, consider using primary keys that will allow you to partition your table into tablets which grow at similar rates.
You can partition your table using Impala's PARTITION BY
clause,
which supports distribution by RANGE
or HASH
. The
partition scheme can contain zero or more HASH
definitions, followed
by an optional RANGE
definition. The RANGE
definition can refer to one or more primary key columns. Examples of basic and
advanced partitioning are shown below.
Monotonically Increasing Values - If you partition by range on a column whose
values are monotonically increasing, the last tablet will grow much larger than the
others. Additionally, all data being inserted will be written to a single tablet at a
time, limiting the scalability of data ingest. In that case, consider distributing by
HASH
instead of, or in addition to, RANGE
.
In general, be mindful the number of tablets limits the parallelism of reads, in the current implementation. Increasing the number of tablets significantly beyond the number of cores is likely to have diminishing returns.