Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426

Merged
merged 4 commits into from
May 28, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 25, 2021

Closes #363

Note: This is an alternate approach to #380, providing statistics using Arrays rather than ScalarValues. Only one of this or #380 should be merged

Rationale

As explained on #363 the high level goal is to make the parquet row group pruning logic generic to any types of min/max statistics (not just parquet metadata)

Changes:

  1. Introduce a new PruningStatistics trait
  2. Refactor PruningPredicateBuilder to be generic in terms of PruningStatistics
  3. Add documentation and tests

Example

Here is a brief snippet of one of the tests that shows the new API:

        // Prune using s2 > 5
        let expr = col("s2").gt(lit(5));

        let statistics = TestStatistics::new().with(
            "s2",
            ContainerStats::new_i32(
                vec![Some(0), Some(4), None, Some(3)], // min
                vec![Some(5), Some(6), None, None],    // max
            ),
        );

        // s2 [0, 5] ==> no rows should pass
        // s2 [4, 6] ==> some rows could pass
        // No stats for s2 ==> some rows could pass
        // s2 [3, None] (null max) ==> some rows could pass

        let p = PruningPredicate::try_new(&expr, schema).unwrap();
        let result = p.prune(&statistics).unwrap();
        let expected = vec![false, true, true, true];

        assert_eq!(result, expected);
    }
}

Sequence:

I am trying to do this in a few small PRs to reduce review burden; Here is how connect together:

Planned changes:

@@ -52,26 +44,81 @@ use crate::{
physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr},
};

/// Interface to pass statistics information to [`PruningPredicates`]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the alternate proposed API based off of @Dandandan and @jorgecarleitao 's comments here: #380 (comment)

@alamb
Copy link
Contributor Author

alamb commented May 25, 2021

FYI @NGA-TRAN

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yap, makes a lot of sense and follows the arrow spirit much better ^_^

I will review this more carefully, but the design is really nice 👍

// column either did't have statistics at all or didn't have min/max values
maybe_scalar.unwrap_or_else(|| null_scalar.clone())
})
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collecting to Vec might not be necessary here, we could maybe provide it to ScalarValue::iter_to_array directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to avoid the collect() but I couldn't get Rust to stop complaining about returning a reference to a local value :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I might have had the same problem here: #339

I changed ScalarValue::iter_to_array to accept ScalarValue instead of &ScalarValue

https://github.com/apache/arrow-datafusion/pull/339/files#diff-89202db09c6a169c91c1f7ec44915cf3e61e738e59928eedac7bc7d578a4051fR327

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- I was trying so hard to avoid having to pass in the owned ScalarValue -- and instead it got even worse -- callers have to collect() a bunch of owned ones all together!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible now that the other PR is merged!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, sadly, when I try (in 86f8079) to remove the collect I get the following panic :(

---- physical_plan::parquet::tests::row_group_predicate_builder_unsupported_type stdout ----
thread 'physical_plan::parquet::tests::row_group_predicate_builder_unsupported_type' panicked at 'Iterator must be sized', /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-4.1.0/src/array/array_boolean.rs:168:33

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that's sad... I think there should be somewhere a violation somewhere of using the trusted length iterator (without requiring unsafe).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- 😢 🐼


let num_containers = statistics.num_containers();

let array = match statistics_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is the core difference for changing it to array - less work here is needed when the statistics are loaded.
Trade off there makes sense for me, at least in cases when we can keep the statistics this should be beneficial.

@alamb
Copy link
Contributor Author

alamb commented May 25, 2021

Yap, makes a lot of sense and follows the arrow spirit much better ^_^

Yes, I am happy with how this turned out (thanks to @Dandandan for suggesting the change 💯 )

@codecov-commenter
Copy link

Codecov Report

Merging #426 (1e88763) into master (ea59d05) will increase coverage by 0.52%.
The diff coverage is 83.62%.

❗ Current head 1e88763 differs from pull request most recent head 1df3f32. Consider uploading reports for the commit 1df3f32 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #426      +/-   ##
==========================================
+ Coverage   74.84%   75.37%   +0.52%     
==========================================
  Files         146      147       +1     
  Lines       24515    24810     +295     
==========================================
+ Hits        18349    18700     +351     
+ Misses       6166     6110      -56     
Impacted Files Coverage Δ
datafusion/src/physical_plan/expressions/mod.rs 71.42% <ø> (ø)
datafusion/src/physical_plan/hash_aggregate.rs 85.21% <ø> (ø)
datafusion/src/physical_plan/sort.rs 91.26% <56.25%> (-0.82%) ⬇️
datafusion/src/physical_plan/window_functions.rs 85.71% <57.89%> (-3.01%) ⬇️
datafusion/src/physical_plan/mod.rs 78.70% <61.90%> (-4.06%) ⬇️
datafusion/src/physical_plan/windows.rs 73.78% <76.57%> (+73.78%) ⬆️
...fusion/src/physical_plan/expressions/row_number.rs 81.25% <81.25%> (ø)
datafusion/src/physical_plan/parquet.rs 82.19% <91.66%> (+0.42%) ⬆️
datafusion/src/optimizer/constant_folding.rs 91.69% <92.10%> (+0.05%) ⬆️
datafusion/src/physical_optimizer/pruning.rs 90.08% <92.64%> (+0.34%) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea59d05...1df3f32. Read the comment docs.

@alamb alamb force-pushed the alamb/generic_pruning_input2 branch from 1df3f32 to ef966f3 Compare May 27, 2021 10:29
@alamb alamb merged commit 7007f8e into apache:master May 28, 2021
@alamb alamb deleted the alamb/generic_pruning_input2 branch May 28, 2021 09:56
@houqp houqp added api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reusable "row group pruning" logic
6 participants