Partition Pruning for Queries

Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions.

If you can arrange for queries to prune large numbers of unnecessary partitions from the query execution plan, the queries use fewer resources and are thus proportionally faster and more scalable.

For example, if a table is partitioned by columns YEAR, MONTH, and DAY, then WHERE clauses such as WHERE year = 2013, WHERE year < 2010, or WHERE year BETWEEN 1995 AND 1998 allow Impala to skip the data files in all partitions outside the specified range. Likewise, WHERE year = 2013 AND month BETWEEN 1 AND 3 could prune even more partitions, reading the data files for only a portion of one year.

If a view applies to a partitioned table, any partition pruning considers the clauses on both the original query and any additional WHERE predicates in the query that refers to the view.

In queries involving both analytic functions and partitioned tables, partition pruning only occurs for columns named in the PARTITION BY clause of the analytic function call. For example, if an analytic function query has a clause such as WHERE year=2016, the way to make the query prune all other YEAR partitions is to include PARTITION BY year in the analytic function call; for example, OVER (PARTITION BY year,other_columns other_analytic_clauses).

Checking if Partition Pruning Happens for a Query

To check the effectiveness of partition pruning for a query, check the EXPLAIN output for the query before running it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation #partitions=1/3 in the EXPLAIN plan confirms that Impala can do the appropriate partition pruning.

[localhost:21000] > INSERT INTO census PARTITION (year=2010) VALUES ('Smith'),('Jones');
[localhost:21000] > INSERT INTO census PARTITION (year=2011) VALUES ('Smith'),('Jones'),('Doe');
[localhost:21000] > INSERT INTO census PARTITION (year=2012) VALUES ('Smith'),('Doe');
[localhost:21000] > EXPLAIN select name from census where year=2010;
+------------------------------------------------------------------+
| Explain String                                                   |
+------------------------------------------------------------------+
| PLAN FRAGMENT 1                                                  |
|   PARTITION: RANDOM                                              |
|                                                                  |
|   STREAM DATA SINK                                               |
|     EXCHANGE ID: 1                                               |
|     UNPARTITIONED                                                |
|                                                                  |
|   0:SCAN HDFS                                                    |
|      table=predicate_propagation.census #partitions=1/3 size=12B |
+------------------------------------------------------------------+

For a report of the volume of data that was actually read and processed at each stage of the query, check the output of the SUMMARY command immediately after running the query. For a more detailed analysis, look at the output of the PROFILE command; it includes this same summary report near the start of the profile output.

Predicate Propagation

Impala can do partition pruning in cases where the partition key column is not directly compared to a constant. Using the predicate propagation technique, Impala applies the transitive property to other parts of the WHERE clause.

In this example, the census table includes another column indicating when the data was collected, which happens in 10-year intervals. Even though the query does not compare the partition key column (YEAR) to a constant value, Impala can deduce that only the partition YEAR=2010 is required, and again only reads 1 out of 3 partitions.

[localhost:21000] > CREATE TABLE census (name STRING, census_year INT) PARTITIONED BY (year INT);
[localhost:21000] > INSERT INTO census PARTITION (year=2010) VALUES  ('Smith',2010),('Jones',2010);
[localhost:21000] > INSERT INTO census PARTITION (year=2011) VALUES  ('Smith',2020),('Jones',2020),('Doe',2020);
[localhost:21000] > INSERT INTO census PARTITION (year=2012) VALUES  ('Smith',2020),('Doe',2020);
[localhost:21000] > 
[localhost:21000] > EXPLAIN select name from census where year = census_year and census_year=2010;
+------------------------------------------------------------------+
| Explain String                                                   |
+------------------------------------------------------------------+
| PLAN FRAGMENT 1                                                  |
|   PARTITION: RANDOM                                              |
|                                                                  |
|   STREAM DATA SINK                                               |
|     EXCHANGE ID: 1                                               |
|     UNPARTITIONED                                                |
|                                                                  |
|   0:SCAN HDFS                                                    |
|      table=predicate_propagation.census #partitions=1/3 size=22B |
|      predicates: census_year = 2010, year = census_year          |
+------------------------------------------------------------------+

Dynamic Partition Pruning

Impala supports two types of partition pruning.

  • Static partition pruning: The conditions in the WHERE clause are analyzed to determine in advance which partitions can be safely skipped.
  • Dynamic partition pruning: Information about the partitions is collected during the query execution, and Impala prunes unnecessary partitions. The information is not available in advance before runtime.

For example, if partition key columns are compared to literal values in a WHERE clause, Impala can perform static partition pruning during the planning phase to only read the relevant partitions:


-- The query only needs to read 3 partitions whose key values are known ahead of time.
-- That's static partition pruning.
SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);

Dynamic partition pruning involves using information only available at run time, such as the result of a subquery. The following example shows a simple dynamic partition pruning.


CREATE TABLE yy (s STRING) PARTITIONED BY (year INT);
INSERT INTO yy PARTITION (year) VALUES ('1999', 1999), ('2000', 2000),
  ('2001', 2001), ('2010', 2010), ('2018', 2018);
COMPUTE STATS yy;

CREATE TABLE yy2 (s STRING, year INT);
INSERT INTO yy2 VALUES ('1999', 1999), ('2000', 2000), ('2001', 2001);
COMPUTE STATS yy2;

-- The following query reads an unknown number of partitions, whose key values
-- are only known at run time. The runtime filters line shows the
-- information used in query fragment 02 to decide which partitions to skip.

EXPLAIN SELECT s FROM yy WHERE year IN (SELECT year FROM yy2);
+--------------------------------------------------------------------------+
| PLAN-ROOT SINK                                                           |
| |                                                                        |
| 04:EXCHANGE [UNPARTITIONED]                                              |
| |                                                                        |
| 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                                 |
| |  hash predicates: year = year                                          |
| |  runtime filters: RF000 <- year                                   |
| |                                                                        |
| |--03:EXCHANGE [BROADCAST]                                               |
| |  |                                                                     |
| |  01:SCAN HDFS [default.yy2]                                            |
| |     partitions=1/1 files=1 size=620B                                   |
| |                                                                        |
| 00:SCAN HDFS [default.yy]                                                |
|    partitions=5/5 files=5 size=1.71KB                               |
|    runtime filters: RF000 -> year                                        |
+--------------------------------------------------------------------------+

SELECT s FROM yy WHERE year IN (SELECT year FROM yy2); -- Returns 3 rows from yy
PROFILE;

In the above example, Impala evaluates the subquery, sends the subquery results to all Impala nodes participating in the query, and then each impalad daemon uses the dynamic partition pruning optimization to read only the partitions with the relevant key values.

The output query plan from the EXPLAIN statement shows that runtime filters are enabled. The plan also shows that it expects to read all 5 partitions of the yy table, indicating that static partition pruning will not happen.

The Filter summary in the PROFILE output shows that the scan node filtered out based on a runtime filter of dynamic partition pruning.

Filter 0 (1.00 MB):
 - Files processed: 3
 - Files rejected: 1 (1)
 - Files total: 3 (3)

Dynamic partition pruning is especially effective for queries involving joins of several large partitioned tables. Evaluating the ON clauses of the join predicates might normally require reading data from all partitions of certain tables. If the WHERE clauses of the query refer to the partition key columns, Impala can now often skip reading many of the partitions while evaluating the ON clauses. The dynamic partition pruning optimization reduces the amount of I/O and the amount of intermediate data stored and transmitted across the network during the query.

Dynamic partition pruning is part of the runtime filtering feature, which applies to other kinds of queries in addition to queries against partitioned tables.