Partitioning considerations

It is useful to incorporate partition pruning into your partitioning strategy.

By: Balazs Jeszenszky, Field Engineer, Cloudera, Inc.

Organizing data sets with partitioning enables you to perform partition pruning. Partition pruning enables queries to scan only a subset of the table data. Generally, this is beneficial, but creating partitions that are too small can create a small files problem, which can negatively impact performance. As a general rule, partitions that are under 1 GB or that contain only 4 files are usually not worth creating.

For what size of table should you consider partitioning? As an example, if you have an unpartitioned table with 15 GBs of data that resides in 60 files, which are 256 MBs each and spread evenly across a 60-node HDFS cluster, a full table scan reads 256 MBs per node. Partitioning this table into 15 equal partitions and filtering to a single partition leads to less data nodes participating in the scan, but each of them still must read 256 MBs of data. This means less workload is spread across the cluster, but involves roughly the same run time. This example represents an optimal case, indicating that it might be beneficial to partition even smaller tables. However, unless you have a specific request for partitioning, one file per node is a good threshold for when to consider partitioning a table.

Partition pruning only works if the queries that are run against a table have a predicate on the partition column. Because of this, partitioning even a large table is only helpful if the queries against that table are understood. For example, if a partitioning strategy is based on the date column, but most of the queries that are run against the table do not include predicates for date, a full table scan will still be necessary and the partitioning will not provide any benefit. Also note that dynamic partition pruning can be helpful. Dynamic partition pruning enables partition pruning at runtime if the partitions are part of a join key. For more information about dynamic partition pruning, see Runtime Filtering for Impala Queries (CDH 5.7 or higher only).

To refresh data in partitioned tables, use the per-partition REFRESH statement to pick up new data in an already existing partition. Use ALTER TABLE [...] RECOVER PARTITIONS to pick up new partitions and their data. These are the most lightweight statements for picking up new data.

It is a good idea to review your partitioning and table strategy when a table reaches 50,000 partitions. Make sure you implement your new strategies before a table reaches 100,000 partitions. These are not hard-and-fast rules on what the system can handle, rather a best practice that helps you avoid major issues.