Column encoding
Depending on the type of the column, Kudu columns can be created with the following encoding types.
- Plain Encoding
-
Data is stored in its natural format. For example,
int32
values are stored as fixed-size 32-bit little-endian integers. - Bitshuffle Encoding
-
A block of values is rearranged to store the most significant bit of every value, followed by the second most significant bit of every value, and so on. Finally, the result is LZ4 compressed. Bitshuffle encoding is a good choice for columns that have many repeated values, or values that change by small amounts when sorted by primary key. The bitshuffle project has a good overview of performance and use cases.
- Run Length Encoding
-
Runs (consecutive repeated values) are compressed in a column by storing only the value and the count. Run length encoding is effective for columns with many consecutive repeated values when sorted by primary key.
- Dictionary Encoding
-
Dictionary encoding can be used for BINARY or STRING columns. A dictionary of unique values is built, and each column value is encoded as its corresponding index in the dictionary. Dictionary encoding is effective for columns with low cardinality. If the column values of a given row set are unable to be compressed because the number of unique values is too high, Kudu will transparently fall back to plain encoding for that row set. This is evaluated during flush.
- Prefix Encoding
-
Common prefixes are compressed in consecutive column values. Prefix encoding can be effective for values that share common prefixes, or the first column of the primary key, since rows are sorted by primary key within tablets.
Each column in a Kudu table can be created with an encoding, based on the type of the column. Starting with Kudu 1.3, default encodings are specific to each column type.
Column Type | Encoding | Default |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|