Compacting partitions

How to change your partition strategy with the least amount of disruption.

By: Balazs Jeszenszky, Field Engineer, Cloudera, Inc.

When reducing the number of partitions it is best to retain as much benefit from the original partitioning strategy as possible. This might not be a trivial task if you cannot influence how end-user queries are written, which might use the original partitioning scheme. For example, a table that used to have daily partitions with a key of the string data type, for example, ‘1970-01-01,’ might have to be compressed into yearly partitions, which are stored as integer data types, such as 1970, to arrive at an acceptable file and partition size:

CREATE TABLE _OLD (c1 int) PARTITIONED BY (_day string);
CREATE TABLE _NEW (c1 int, _day string) PARTITIONED BY (year int);

The non-partition _day column is added to avoid losing data. To retain partition pruning for queries that filter on _day, you can define a view:

CREATE VIEW _NEW_VIEW AS SELECT * FROM _NEW WHERE year = year(_day);

When queries run against this view their _day predicate is translated to a partition filter. The year built-in UDF is useful in this case, but others, such as substr also work.