Skip Scheduling Bloom Filter

As part of performance optimization, you can skip scheduling bloom filter from join node with certain characteristics.

PK-FK joins between a dimension table and a fact table are common occurrences in a query. Such joins often do not involve any predicate filters in the dimension table. As a result, a bloom filter generated from this kind of dimension table scan (PK) will most likely contain all values from the fact table column (FK). It becomes ineffective to generate this filter because it is unlikely to reject any rows, especially if the bloom filter size is large and has a high false positive probability (FPP) estimate.

As part of this optimization, this release skips scheduling bloom filter from join node that has the following characteristics:
  • Build side is full table scan and has hard estimates.
  • The build scan does not have any predicate filter nor consume any runtime filter.
  • The join node is assumed to have PK-FK relationship.
  • The planned bloom filter has a result with an estimate higher than the default set through this flagmax_filter_error_rate_from_full_scan (default to 0.9).
The following flag is added to control the generation of bloom filters.
max_filter_error_rate_from_full_scan = 0.9,

This flag allows skip generation of the bloom runtime filter that is generated from a full build scan and has resulting error rate estimation that is higher than the value set in this flag after the filter size limit is applied. This config may get ignored if target error rate is set with higher value through RUNTIME_FILTER_ERROR_RATE query option or max_filter_error_rate backend flag. Setting a value less than 0 will disable this runtime filter reduction feature. Similarly, setting max_filter_error_rate_from_full_scan to a value less than 0 will disable this runtime filter reduction feature.