Late Materialization of Columns
This is an optimization feature available in Impala and is implemented to optimize queries that reference a large number of columns and filter out a significant portion of the rows. This currently works only for Parquet tables.
When a query is processed, any fields referenced in the select list are decoded from file storage and materialized into row values in memory to be processed for joins, predicates, and so on. Prior to this optimization, the materialization is applied to all the row values, regardless of whether they are eliminated by predicates. However with late materialization, as the name implies, only fields that are referenced by predicates and rows that are not filtered out by predicates are fully materialized.
When a select
* query is run over a 4 billion row table that returns a single row, it may take
~30 seconds to execute, however it may take less than 3 seconds if the query is replaced by
select field1, field2 ...
. This is because the select
* query is materializing all the fields
for rows, regardless of whether they match or not. And in the case of the select field1, field2
query only the columns that are required for filtering the data need to be materialized.
Use the configuration parameter (query option)
parquet_late_materialization_threshold
to provide the minimum number of
consecutive rows that are filtered out to avoid materialization. If set to less than 0, it
disables the late materialization. In most cases the default setting of 20 will provide optimal
performance.