Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize filter for structs and sparse unions #6304

Merged
merged 5 commits into from
Aug 31, 2024

Conversation

gstvg
Copy link
Contributor

@gstvg gstvg commented Aug 24, 2024

Which issue does this PR close?

N/A

Rationale for this change

Filtering structs and sparse unions is just filtering children arrays:
If any concrete children has a specialized filter, currently this is bypassed and MutableArrayData is used instead
With this change, those specializations are used.

What changes are included in this PR?

Structs and sparse unions filter specialization
If the filtered array is a multi column struct or a non-fieldless union, optimize the filter predicate
Add as_union and as_union_opt to AsArray sealed trait

Are there any user-facing changes?

No

@gstvg gstvg changed the title specialize filter for structs and sparse unions Specialize filter for structs and sparse unions Aug 24, 2024
@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 24, 2024
@@ -1871,4 +1937,75 @@ mod tests {
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union tests already exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I double checked and they are immediate above this --

    fn test_filter_union_array_sparse() {
...

👍

let predicate = FilterBuilder::new(predicate).build();
let mut filter_builder = FilterBuilder::new(predicate);

fn multiple_arrays(data_type: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this method to the top-level? I know that Rust allows that but I think this feature confuses people more than it helps, esp. since this does NOT create a closure, i.e. variable capture is not possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a top level function would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

if multiple_arrays(values.data_type()) {
filter_builder = filter_builder.optimize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you call optimize if the data type does NOT have multiple arrays? I'm wondering if we really need this branch or if we could optimize unconditionally. Maybe if "optimize" doesn't really "optimize" in all cases, we should fix that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale is that "optimizing" the filter requires looking at the BooleanArray which itself requires non trivial time, so for certain operations the overhead of figuring out a better filter strategy takes more time than actually running it

This is basically the same algorithm used in filter_record_batch: https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#179-183

Users who want to always optimize can use a FilterBuilder and explicitly call optimize https://docs.rs/arrow/latest/arrow/compute/struct.FilterBuilder.html#method.optimize

I made a PR to try and clarify this in the docs #6317

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gstvg and @crepererum -- I agree with @crepererum's comments but otherwise this PR looks good to me.

@@ -815,6 +815,14 @@ pub trait AsArray: private::Sealed {
self.as_struct_opt().expect("struct array")
}

/// Downcast this to a [`UnionArray`] returning `None` if not possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let predicate = FilterBuilder::new(predicate).build();
let mut filter_builder = FilterBuilder::new(predicate);

fn multiple_arrays(data_type: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a top level function would be better

}

if multiple_arrays(values.data_type()) {
filter_builder = filter_builder.optimize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale is that "optimizing" the filter requires looking at the BooleanArray which itself requires non trivial time, so for certain operations the overhead of figuring out a better filter strategy takes more time than actually running it

This is basically the same algorithm used in filter_record_batch: https://docs.rs/arrow-select/52.2.0/src/arrow_select/filter.rs.html#179-183

Users who want to always optimize can use a FilterBuilder and explicitly call optimize https://docs.rs/arrow/latest/arrow/compute/struct.FilterBuilder.html#method.optimize

I made a PR to try and clarify this in the docs #6317

@@ -1871,4 +1937,75 @@ mod tests {
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I double checked and they are immediate above this --

    fn test_filter_union_array_sparse() {
...

👍

@gstvg
Copy link
Contributor Author

gstvg commented Aug 29, 2024

Thanks @crepererum and @alamb for the reviews, I applied the suggestions

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crepererum and @gstvg

I merged this PR up from main to resolve a merge conflict as I want to prepare a release candidate

@alamb alamb merged commit 831a080 into apache:master Aug 31, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

🚀

@V0ldek V0ldek mentioned this pull request Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants