Support general pruning based on `<col> = 'const'` in `PruningPredicate` #8376

alamb · 2023-11-30T16:49:10Z

Is your feature request related to a problem or challenge?

In IOx in certain cases we know that a "container" (parquet file, or set of record batches) has only a single value for some computed quantity (in our case hash(column) % N for some constant N (like 100).

For this case, we want to be able to quickly determine, given an arbitrary predicate, if that container could not possible contain the value.

So for example, if we know the container has hash(column) % 100 = 27, given a predicate that includes an expression like column = 'foo', we can compute the quantity hash(column) % 100 and if it is not 27 we can skip the entire container.

We could implement this directly in our codebase, and we may do so temporarily. However, I think the usecase is common enough that we would like to improve the support upstream in DataFusion so others can both benefit and help optimize for it.

For example, applying BloomFilters to prune out parquet row groups (added in #7821 by @hengfeiyang) has the same pattern.

Since we also have other information such as min/max and null counts for certain columns that we prune using PruningPredicate, having this ability be part of PruningPredicate is compelling

Describe the solution you'd like

DataFusion's PruningPredicate can already use information on ranges (min/max values). I would like to extend its capabilities to incorporate knowledge about certain specific values to take advantage of column = <constant> predicates.

I propose we extend PruningStatistics with some way to pass knowledge on about the contents of data structures like Bloom Filters. For example:

pub trait PruningStatistics {
    // PROPOSED ADDITIONS

    // Returns an array where each element is
    // * `true` if the value of column CERTAINLY DOES contain `value`
    // * `false` if the value of column CERTAINLY DOES NOT contain `value`
    // * `null` if the value of column may or may not contain the value
    fn contains(&self, column: &Column, value: &ScalarValue) -> Result<Option<BooleanArray>>;

    // EXISTING METHODS
    fn min_values(&self, column: &Column) -> Option<ArrayRef>;
    fn max_values(&self, column: &Column) -> Option<ArrayRef>;
    fn num_containers(&self) -> usize;
    fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
}

We could then implement the bloom filter pruning in DataFusion with this API as well as use the same thing for our downstream usecase

Example for equality predicate `col = 'foo'`

The PruningPredicate would call

PruningStatistics::contains(Column(col), 'foo')

and could prune all containers that returned false

Example for inequality predicate `col != 'foo'`

The PruningPredicate would call

PruningStatistics::contains(Column(col), 'foo')

and could prune all containers that returned true

Note I don't think the contains API could be used for other inequality predicates like col < 'foo' for example. The existing min/max statistics would have to be used

Describe alternatives you've considered

I also thought about trying to rewrite equality predicates to take advantage of the existing min/max statistics (which can represent where a column has only a single value). However, that API doesn't allow for information like Bloom filters which simply can say for sure if the value may be present or not.

Additional context

Here is the code that does bloom filtering: https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L203-L252

The text was updated successfully, but these errors were encountered:

alamb · 2023-12-11T14:20:47Z

@waynexia / @haohuaijin I wonder if you have time to review the PRs that are part of this issue (would make parquet statistics pruning more effective, among other things):

Add LiteralGuarantee on columns to extract conditions required for PhysicalExpr expressions to evaluate to true #8437
Implement contained API in PruningPredicate #8440
POC Make BloomFilter application general, add PruningPredicate::contains #8397

cc @Dandandan and @thinkharderdev as I think you use this API as well and perhaps may have an interest as well

alamb · 2023-12-23T12:47:59Z

Technically speaking this was closed with #8440

#8442 is a cleanup but I am going to close this ticket as done!

alamb added the enhancement New feature or request label Nov 30, 2023

alamb mentioned this issue Nov 30, 2023

Epic: Statistics improvements #8227

Open

19 tasks

alamb self-assigned this Nov 30, 2023

alamb mentioned this issue Dec 11, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 11, 2023 #8490

Closed

8 tasks

alamb mentioned this issue Dec 18, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 18, 2023 #8577

Closed

7 tasks

alamb closed this as completed Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support general pruning based on `<col> = 'const'` in `PruningPredicate` #8376

Support general pruning based on `<col> = 'const'` in `PruningPredicate` #8376

alamb commented Nov 30, 2023 •

edited

Loading

alamb commented Dec 11, 2023

alamb commented Dec 23, 2023

Support general pruning based on <col> = 'const' in PruningPredicate #8376

Support general pruning based on <col> = 'const' in PruningPredicate #8376

Comments

alamb commented Nov 30, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Example for equality predicate col = 'foo'

Example for inequality predicate col != 'foo'

Describe alternatives you've considered

Additional context

alamb commented Dec 11, 2023

alamb commented Dec 23, 2023

Support general pruning based on `<col> = 'const'` in `PruningPredicate` #8376

Support general pruning based on `<col> = 'const'` in `PruningPredicate` #8376

alamb commented Nov 30, 2023 •

edited

Loading

Example for equality predicate `col = 'foo'`

Example for inequality predicate `col != 'foo'`