-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426
Conversation
@@ -52,26 +44,81 @@ use crate::{ | |||
physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, PhysicalExpr}, | |||
}; | |||
|
|||
/// Interface to pass statistics information to [`PruningPredicates`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the alternate proposed API based off of @Dandandan and @jorgecarleitao 's comments here: #380 (comment)
FYI @NGA-TRAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yap, makes a lot of sense and follows the arrow spirit much better ^_^
I will review this more carefully, but the design is really nice 👍
// column either did't have statistics at all or didn't have min/max values | ||
maybe_scalar.unwrap_or_else(|| null_scalar.clone()) | ||
}) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collecting to Vec
might not be necessary here, we could maybe provide it to ScalarValue::iter_to_array
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to avoid the collect() but I couldn't get Rust to stop complaining about returning a reference to a local value :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I might have had the same problem here: #339
I changed ScalarValue::iter_to_array to accept ScalarValue
instead of &ScalarValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- I was trying so hard to avoid having to pass in the owned ScalarValue
-- and instead it got even worse -- callers have to collect()
a bunch of owned ones all together!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be possible now that the other PR is merged!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, sadly, when I try (in 86f8079) to remove the collect I get the following panic :(
---- physical_plan::parquet::tests::row_group_predicate_builder_unsupported_type stdout ----
thread 'physical_plan::parquet::tests::row_group_predicate_builder_unsupported_type' panicked at 'Iterator must be sized', /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-4.1.0/src/array/array_boolean.rs:168:33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm that's sad... I think there should be somewhere a violation somewhere of using the trusted length iterator (without requiring unsafe
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- 😢 🐼
|
||
let num_containers = statistics.num_containers(); | ||
|
||
let array = match statistics_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is the core difference for changing it to array - less work here is needed when the statistics are loaded.
Trade off there makes sense for me, at least in cases when we can keep the statistics this should be beneficial.
Yes, I am happy with how this turned out (thanks to @Dandandan for suggesting the change 💯 ) |
Codecov Report
@@ Coverage Diff @@
## master #426 +/- ##
==========================================
+ Coverage 74.84% 75.37% +0.52%
==========================================
Files 146 147 +1
Lines 24515 24810 +295
==========================================
+ Hits 18349 18700 +351
+ Misses 6166 6110 -56
Continue to review full report at Codecov.
|
1df3f32
to
ef966f3
Compare
Closes #363
Note: This is an alternate approach to #380, providing statistics using
Array
s rather thanScalarValue
s. Only one of this or #380 should be mergedRationale
As explained on #363 the high level goal is to make the parquet row group pruning logic generic to any types of min/max statistics (not just parquet metadata)
Changes:
PruningStatistics
traitPruningPredicateBuilder
to be generic in terms ofPruningStatistics
Example
Here is a brief snippet of one of the tests that shows the new API:
Sequence:
I am trying to do this in a few small PRs to reduce review burden; Here is how connect together:
Planned changes:
Fn
#370)ScalarValue::iter_to_array
(Function to createArrayRef
from an iterator of ScalarValues #381)PruningStatstics
Trait (this PR)