-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620
Comments
One thing to perhaps be aware of is that we are entirely reliant on LLVM to vectorise various kernels. I have found that any branch in the body of the loop, even something as simple as This has led to a number of tricks such as As such I wonder if a first step might be to implement this solely within DataFusion, in particular using the This would also allow us to potentially investigate less intrusive changes, depending on where the filter overheads manifest. For example, late materialization with support for a form of this short-circuiting, was recently added to the parquet reader. We could theoretically build upon this base, without needing to make more intrusive modifications. I think it would be really cool to support this, but my experience fighting LLVM over null masks, the speed of the filter kernels, and the reality that a lot of queries end up bottlenecked on sorting or decoding, makes me think there may be mileage in the naive approach. I'm not expert on query engines though, so happy to defer to others 😄 |
Thanks @tustvold ! Yes, I think it's a good idea to start with a PoC in DataFusion only. I'll try to see if we can get some good numbers with the approach using some synthetic benchmarks :) One question: how do you detect whether certain code change would break SIMD? is there any convenient way of doing that? I'll take a look at the lazy materialization on Parquet side and see how it can interact with this feature.
Agree. My feeling is also that many queries are actually bottlenecked on somewhere else like join or aggregation. It just caught my attention while I'm looking at DataFusion and |
I use godbolt with mocked up functions a lot, and then confirm any changes with benchmarks. Very occasionally I use gdb to disassemble symbols, as certain things like inlining heuristics are hard to mock up. Note that you need to override the default target, e.g. |
I think |
I believe this refers to https://docs.rs/parquet/31.0.0/parquet/arrow/arrow_reader/struct.RowSelector.html and https://docs.rs/parquet/31.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html |
This paper about Fused Table Scans describes a vectorized implementation of evaluating two predicates more efficiently. The first predicate is evaluted into indices instead of a bitmap, these indices are then used to gather data into simd registers for the second predicate. The improvement probably depends on the selectivity of the first predicate and doing this optimally would require some statistics about the input arrays. |
Here is a discussion in DataFusion about something similar: apache/datafusion#5944 |
Thinking about this a bit more, the intention of a selection vector is to allow a kernel to skip an expensive computation, such as a string comparison or regex evaluation, when the result is unimportant because we know it is going to be discarded. For some kernels the cost of consulting the selection vector will outweigh any savings, especially for kernels like integer comparison where it interferes with vectorisation. Now the potentially interesting observation is the exact same principle also holds for null masks, we shouldn't spend time performing expensive evaluation on null slots. I think we currently do in some cases, but this should be easy to fix. This then leads to the obvious question, if a false value in a selection vector indicates that the result doesn't matter, how would the semantics of an operation under a selection vector differ from the semantics of an operation with the arrays first passed to |
I think they are "almost always" the same. For example the (non null) values of However, it is an excellent point that for many kernels, the implementation is probably exactly the same after "AND"ing together the validity mask and selection vector. |
Related discussion: #4393 (comment) |
A value not being in the selection mask implies it is going to be discarded regardless of the output of the current kernel. I'm therefore not sure why it would ever matter? What am I missing here? |
Correct -- I thought you were asking if there were semantic differences between a selection vector and a null mask and so I was providing an example where there would be. I probably misunderstood this:
|
What about something like this (where you conditionally generate an error if the row is included in the computation)? CASE
WHEN x IS NOT NULL
THEN x
ELSE x / 0 -- <--- should never be hit / error
END |
I think you would need something that would side-effect for nulls in order to cause an issue for the approach of encoding the selection vector as a null mask. I'm struggling to think of a kernel within arrow-rs where this would be the case... All the non-side effect free kernels should only consider valid slots, if they aren't that is a bug as the value of a null slot can be arbitrary including any problematic values. |
This was a prescient comment. Here's also a recent (DaMoN 2024) evaluation of NULL representations: "NULLS! Revisiting Null Representation in Modern Columnar Formats" |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently for common expressions such as
AND
orOR
, we don't apply any short-circuiting and therefore the same columnar batch needs to be fully evaluated on every predicate.For instance, consider the following example:
We would evaluate each batch on both predicates, and apply bit-and on the result
BooleanArray
from both side. Similarly forOR
.This would not be efficient in many cases, nor correct (see apache/datafusion#5093 for a bug report). A more efficient approach, is perhaps to only apply the second predicate on the remaining rows from the evaluation of the first predicate. This could be especially effective if the first predicate has low selectivity.
Note, sometimes it would still be beneficial to evaluate the full batch to take advantage of SIMD. For a detailed analysis, please check https://dl.acm.org/doi/abs/10.1145/3465998.3466009.
This approach has been adopted by other popular engines such as Velox, Databricks Photon, etc.
Describe the solution you'd like
Implement the short-circuiting logic in both
arrow-rs
andarrow-datafusion
. This could introduce a lot of API changes since we may need to introduce an extra parameterSelectivityVector
for related compute kernels (e.g.,is_null
). We also need to changearrow-datafusion
'sPhysicalExpr::evaluate
to take theSelectivityVector
into account. Note similar features has been done forCASE WHEN
with the introduction of a separateevaluate_selection
method. See apache/datafusion#2068 for more details.Describe alternatives you've considered
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: