-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize filter for structs and sparse unions #6304
Specialize filter for structs and sparse unions #6304
Conversation
@@ -1871,4 +1937,75 @@ mod tests { | |||
} | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Union tests already exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I double checked and they are immediate above this --
fn test_filter_union_array_sparse() {
...
👍
arrow-select/src/filter.rs
Outdated
let predicate = FilterBuilder::new(predicate).build(); | ||
let mut filter_builder = FilterBuilder::new(predicate); | ||
|
||
fn multiple_arrays(data_type: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move this method to the top-level? I know that Rust allows that but I think this feature confuses people more than it helps, esp. since this does NOT create a closure, i.e. variable capture is not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a top level function would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
if multiple_arrays(values.data_type()) { | ||
filter_builder = filter_builder.optimize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if you call optimize
if the data type does NOT have multiple arrays? I'm wondering if we really need this branch or if we could optimize unconditionally. Maybe if "optimize" doesn't really "optimize" in all cases, we should fix that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rationale is that "optimizing" the filter requires looking at the BooleanArray which itself requires non trivial time, so for certain operations the overhead of figuring out a better filter strategy takes more time than actually running it
This is basically the same algorithm used in filter_record_batch
: https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#179-183
Users who want to always optimize can use a FilterBuilder
and explicitly call optimize
https://docs.rs/arrow/latest/arrow/compute/struct.FilterBuilder.html#method.optimize
I made a PR to try and clarify this in the docs #6317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gstvg and @crepererum -- I agree with @crepererum's comments but otherwise this PR looks good to me.
@@ -815,6 +815,14 @@ pub trait AsArray: private::Sealed { | |||
self.as_struct_opt().expect("struct array") | |||
} | |||
|
|||
/// Downcast this to a [`UnionArray`] returning `None` if not possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
arrow-select/src/filter.rs
Outdated
let predicate = FilterBuilder::new(predicate).build(); | ||
let mut filter_builder = FilterBuilder::new(predicate); | ||
|
||
fn multiple_arrays(data_type: &DataType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a top level function would be better
} | ||
|
||
if multiple_arrays(values.data_type()) { | ||
filter_builder = filter_builder.optimize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rationale is that "optimizing" the filter requires looking at the BooleanArray which itself requires non trivial time, so for certain operations the overhead of figuring out a better filter strategy takes more time than actually running it
This is basically the same algorithm used in filter_record_batch
: https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#179-183
Users who want to always optimize can use a FilterBuilder
and explicitly call optimize
https://docs.rs/arrow/latest/arrow/compute/struct.FilterBuilder.html#method.optimize
I made a PR to try and clarify this in the docs #6317
@@ -1871,4 +1937,75 @@ mod tests { | |||
} | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I double checked and they are immediate above this --
fn test_filter_union_array_sparse() {
...
👍
…ialize_filter_struct_sparse_union
Thanks @crepererum and @alamb for the reviews, I applied the suggestions |
…truct_sparse_union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crepererum and @gstvg
I merged this PR up from main to resolve a merge conflict as I want to prepare a release candidate
🚀 |
Which issue does this PR close?
N/A
Rationale for this change
Filtering structs and sparse unions is just filtering children arrays:
If any concrete children has a specialized filter, currently this is bypassed and
MutableArrayData
is used insteadWith this change, those specializations are used.
What changes are included in this PR?
Structs and sparse unions filter specialization
If the filtered array is a multi column struct or a non-fieldless union, optimize the filter predicate
Add
as_union
andas_union_opt
toAsArray
sealed traitAre there any user-facing changes?
No