DDL Bucketed Tables

To enhance query performance, create bucketed tables for the purpose of dividing data into smaller, more manageable chunks based on specified columns using the CLUSTER BY clause.

Syntax

Example 1: Bucketed table based on user_id

In the first example, the data is bucketed solely based on the user_id column.

CREATE TABLE tbl(
   user_id BIGINT,
   firstname STRING, 
   lastname STRING
)
CLUSTERED BY(user_id) INTO 256 BUCKETS;

Example 2: Bucketed table based on user_id and sorted by firstname

In the second example, in addition to bucketing by user_id, the data is also sorted within each bucket based on the firstname column. The bucket partitioning algorithm employs a hash function to evenly distribute rows across the table, aiding in optimizing query performance.

CREATE TABLE tbl2 (
  user_id BIGINT,
  firstname STRING,
  lastname STRING
)
CLUSTERED BY (user_id) SORT BY (firstname) INTO 256 BUCKETS;

Limitations

CREATE bucketed table statement is currently not supported on Kudu and Iceberg tables.
Alter operations such as add/drop/change/replace columns on bucketed tables are also not supported.
Impala currently does not leverage the bucket info to optimize queries.