Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate use of selection vectors in scan-filter-join operations #745

Open
Tracked by #717
andygrove opened this issue Jul 31, 2024 · 2 comments
Open
Tracked by #717

Evaluate use of selection vectors in scan-filter-join operations #745

andygrove opened this issue Jul 31, 2024 · 2 comments
Labels
enhancement New feature or request performance

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

It is very common to have scan -> filter as inputs to a join. The copying of data in the filter can be expensive when the batch contains strings and complex types, and the result of the filter is discarded after the join.

I believe that it would be more efficient to have the join use a selection vector to read inputs from the scanned batch rather than perform a filter.

This issue is for tracking the work to create a small prototype to demonstrate. If succesful, then we can discuss making changes in upstream DataFusion to add support for a new ColumnarValue::ArrayWithSelectionVector and then add a specialization in SortMergeJoin to take advantage of this.

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added enhancement New feature or request performance labels Jul 31, 2024
@andygrove andygrove added this to the 0.2.0 milestone Jul 31, 2024
@viirya
Copy link
Member

viirya commented Jul 31, 2024

Related issue at arrow-rs: apache/arrow-rs#3620

@andygrove
Copy link
Member Author

This paper may have useful information:

"Filter Representation in Vectorized Query Execution"
https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf

@andygrove andygrove removed this from the 0.2.0 milestone Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

2 participants