Compacting partitions
How to change your partition strategy with the least amount of disruption.
By: Balazs Jeszenszky, Field Engineer, Cloudera, Inc.
When reducing the number of partitions it is best to retain as much
benefit from the original partitioning strategy as possible. This might not
be a trivial task if you cannot influence how end-user queries are written,
which might use the original partitioning scheme. For example, a table that
used to have daily partitions with a key of the string
data type, for
example, ‘1970-01-01,’ might have to be compressed into yearly partitions,
which are stored as integer
data types, such as 1970, to arrive at an
acceptable file and partition size:
CREATE TABLE _OLD (c1 int) PARTITIONED BY (_day string);
CREATE TABLE _NEW (c1 int, _day string) PARTITIONED BY (year int);
The non-partition _day
column is added to avoid losing
data. To retain partition pruning for queries that filter on _day
,
you can define a view:
CREATE VIEW _NEW_VIEW AS SELECT * FROM _NEW WHERE year = year(_day);
When queries run against this view their _day
predicate
is translated to a partition filter. The year
built-in UDF
is useful in this case, but others, such as substr
also work.