Partitioning considerations
It is useful to incorporate partition pruning into your partitioning strategy.
By: Balazs Jeszenszky, Field Engineer, Cloudera, Inc.
Organizing data sets with partitioning enables you to perform partition pruning. Partition pruning enables queries to scan only a subset of the table data. Generally, this is beneficial, but creating partitions that are too small can create a small files problem, which can negatively impact performance. As a general rule, partitions that are under 1 GB or that contain only 4 files are usually not worth creating.
For what size of table should you consider partitioning? As an example, if you have an unpartitioned table with 15 GBs of data that resides in 60 files, which are 256 MBs each and spread evenly across a 60-node HDFS cluster, a full table scan reads 256 MBs per node. Partitioning this table into 15 equal partitions and filtering to a single partition leads to less data nodes participating in the scan, but each of them still must read 256 MBs of data. This means less workload is spread across the cluster, but involves roughly the same run time. This example represents an optimal case, indicating that it might be beneficial to partition even smaller tables. However, unless you have a specific request for partitioning, one file per node is a good threshold for when to consider partitioning a table.
Partition pruning only works if the queries that are run against a table
have a predicate on the partition column. Because of this, partitioning
even a large table is only helpful if the queries against that table are
understood. For example, if a partitioning strategy is based on the date
column, but most of the queries that are run against the table do not
include predicates for date
, a full table scan will still be necessary and
the partitioning will not provide any benefit. Also note that dynamic
partition pruning can be helpful. Dynamic partition pruning enables
partition pruning at runtime if the partitions are part of a join key. For
more information about dynamic partition pruning, see Runtime Filtering for Impala Queries (CDH 5.7 or higher
only).
To refresh data in partitioned tables, use the per-partition REFRESH
statement to pick up new data in an already existing partition. Use
ALTER TABLE [...] RECOVER PARTITIONS
to pick up new
partitions and their data. These are the most lightweight statements for
picking up new data.
It is a good idea to review your partitioning and table strategy when a table reaches 50,000 partitions. Make sure you implement your new strategies before a table reaches 100,000 partitions. These are not hard-and-fast rules on what the system can handle, rather a best practice that helps you avoid major issues.