Partition transform feature

From Hive or Impala, you can use one or more partition transforms to partition your data. Each transform is applied to a single column. Identity-transform means no transformation; the column values are used for partitioning. The other transforms apply a function to the column values and the data is partitioned by the transformed values.

Using CREATE TABLE ... PARTITIONED BY you create identity-partitioned Iceberg tables. Identity-partitioned Iceberg tables are similar to the Hive or Impala partitioned tables, which are stored in the same directory structure as the data files. Iceberg stores the partitioning columns of identity-partitioned Iceberg tables in a different directory structure from the data files if the tables are migrated to Iceberg from Hive external tables. Iceberg handles the tables and files regardless of the location.

Hive and Impala support Iceberg advanced partitioning through the PARTITION BY SPEC clause. Using this clause, you can define the Iceberg partition fields and partition transforms.

The following table shows the available partition transforms and their corresponding transform specifications:


Partition transform	Transform specification	Supported SQL engine
Partition by year	years(time_stamp) \| year(time_stamp)	Hive and Impala
Partition by month	months(time_stamp) \| month(time_stamp)	Hive and Impala
Partition by a date value stored as int (dateint)	days(time_stamp) \| date(time_stamp)	Hive
Partition by hours	hours(time_stamp)	Hive
Partition by a dateint in hours	date_hour(time_stamp)	Hive
Partition by hashed value mod N buckets	bucket(N, col)	Hive and Impala
Partition by value truncated to L, which is a number of characters	truncate(L, col)	Hive and Impala
Partition by hashed value mod N buckets	bucket(N, col)	Hive and Impala
Partition by value truncated to L, which is a number of characters	truncate(L, col)	Hive and Impala
Partition by value	identity(col)	Impala
Partition by null values	void(col)	Impala

The BUCKET, TRUNCATE, IDENTITY, and VOID transforms now support the BINARY type.

The VOID transform might be used to replace the transform in an existing partition field so that the field is effectively dropped in v1 tables.

Strings are truncated to length L. Integers and longs are truncated to bins. For example, truncate(10, i) yields partitions 0, 10, 20, 30 …

The idea behind transformation partition by hashed value mod N buckets is the same as hash bucketing for Hive tables. A hashing algorithm calculates the bucketed column value (modulus). For example, for 10 buckets, data is stored in column value % 10, ranging from 0-9 (0 to n-1) buckets.

You use the PARTITIONED BY SPEC clause to partition a table by an identity transform.

Hive syntax

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name	  
  [(col_name data_type][, time_stamp TIMESTAMP] )]
  [PARTITIONED BY SPEC([col_name][, spec(value)][, spec(value)]...)]
  [STORED AS file_format]
   STORED BY ICEBERG
  [TBLPROPERTIES (property_name=property_value, ...)]

Where spec (value) represents one or more of the following transforms:

YEARS(col_name)
MONTHS(col_name)
DAYS(col_name)
BUCKET(bucket_num,col_name)
TRUNCATE(length, col_name)

Impala syntax

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name	  
  [(col_name data_type, ... )]
  [PARTITIONED BY SPEC([col_name][, spec(value)][, spec(value)]...)]
   STORED (AS | BY) ICEBERG
  [TBLPROPERTIES (property_name=property_value, ...)]

Where spec(value) represents one or more of the following transforms:

YEARS(col_name)
MONTHS(col_name)
DAYS(col_name)
BUCKET(bucket_num,col_name)
TRUNCATE(length, col_name)

Hive example

The following example creates a top level partition based on column i, a second level partition based on the hour part of the timestamp, and a third level partition based on the first 1000 characters in column j.

CREATE EXTERNAL TABLE ice_3 (i INT, t TIMESTAMP, j BIGINT) PARTITIONED BY SPEC (i, HOUR(t), TRUNCATE(1000, j)) STORED BY ICEBERG;

Impala examples

CREATE TABLE ice_13 (i INT, t TIMESTAMP, j BIGINT) PARTITIONED BY SPEC (i, HOUR(t), TRUNCATE(1000, j)) STORED BY ICEBERG;

The following examples show how to use the PARTITION BY SPEC clause in a CREATE TABLE query from Impala.The same transforms are available in a CREATE EXTERNAL TABLE query from Hive.

CREATE TABLE ice_t(id INT, name STRING, dept STRING)
PARTITIONED BY SPEC (bucket(19, id), dept)
STORED BY ICEBERG
TBLPROPERTIES ('format-version'='2');

CREATE TABLE ice_ctas
PARTITIONED BY SPEC (truncate(1000, id))
STORED BY ICEBERG
TBLPROPERTIES ('format-version'='2')
AS SELECT id, int_col, string_col FROM source_table;