Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

Open
sunchao opened this issue Jan 27, 2023 · 15 comments
Open

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

sunchao opened this issue Jan 27, 2023 · 15 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@sunchao
Copy link
Member

sunchao commented Jan 27, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently for common expressions such as AND or OR, we don't apply any short-circuiting and therefore the same columnar batch needs to be fully evaluated on every predicate.

For instance, consider the following example:

a = 'FOO' AND b = 42

We would evaluate each batch on both predicates, and apply bit-and on the result BooleanArray from both side. Similarly for OR.

This would not be efficient in many cases, nor correct (see apache/datafusion#5093 for a bug report). A more efficient approach, is perhaps to only apply the second predicate on the remaining rows from the evaluation of the first predicate. This could be especially effective if the first predicate has low selectivity.

Note, sometimes it would still be beneficial to evaluate the full batch to take advantage of SIMD. For a detailed analysis, please check https://dl.acm.org/doi/abs/10.1145/3465998.3466009.

This approach has been adopted by other popular engines such as Velox, Databricks Photon, etc.

Describe the solution you'd like

Implement the short-circuiting logic in both arrow-rs and arrow-datafusion. This could introduce a lot of API changes since we may need to introduce an extra parameter SelectivityVector for related compute kernels (e.g., is_null). We also need to change arrow-datafusion's PhysicalExpr::evaluate to take the SelectivityVector into account. Note similar features has been done for CASE WHEN with the introduction of a separate evaluate_selection method. See apache/datafusion#2068 for more details.

Describe alternatives you've considered

N/A

Additional context

N/A

@sunchao sunchao added the enhancement Any new improvement worthy of a entry in the changelog label Jan 27, 2023
@tustvold
Copy link
Contributor

tustvold commented Jan 27, 2023

One thing to perhaps be aware of is that we are entirely reliant on LLVM to vectorise various kernels. I have found that any branch in the body of the loop, even something as simple as Iterator::next, causes it to fail to vectorise the loop.

This has led to a number of tricks such as PrimitiveArray::unary and MutableBuffer::collect_bool that help it vectorise correctly, by ignoring the null mask. I suspect a selection vector would run into similar challenges. Whilst I don't doubt that custom SIMD kernels could apply selection masks and null masks "for free", I have not found a good way to get LLVM to figure this out.

As such I wonder if a first step might be to implement this solely within DataFusion, in particular using the filter kernel to eagerly apply the selection to the values if the next expression is not side-effect free, or if selectivity has fallen below some threshold? We can then go from there if this filtering starts to become a major bottleneck?

This would also allow us to potentially investigate less intrusive changes, depending on where the filter overheads manifest. For example, late materialization with support for a form of this short-circuiting, was recently added to the parquet reader. We could theoretically build upon this base, without needing to make more intrusive modifications.

I think it would be really cool to support this, but my experience fighting LLVM over null masks, the speed of the filter kernels, and the reality that a lot of queries end up bottlenecked on sorting or decoding, makes me think there may be mileage in the naive approach. I'm not expert on query engines though, so happy to defer to others 😄

@sunchao
Copy link
Member Author

sunchao commented Jan 27, 2023

Thanks @tustvold !

Yes, I think it's a good idea to start with a PoC in DataFusion only. I'll try to see if we can get some good numbers with the approach using some synthetic benchmarks :)

One question: how do you detect whether certain code change would break SIMD? is there any convenient way of doing that?

I'll take a look at the lazy materialization on Parquet side and see how it can interact with this feature.

I think it would be really cool to support this, but my experience fighting LLVM over null masks, the speed of the filter kernels, and the reality that a lot of queries end up bottlenecked on sorting or decoding, makes me think there may be mileage in the naive approach. I'm not expert on query engines though, so happy to defer to others 😄

Agree. My feeling is also that many queries are actually bottlenecked on somewhere else like join or aggregation. It just caught my attention while I'm looking at DataFusion and arrow-rs.

@tustvold
Copy link
Contributor

tustvold commented Jan 27, 2023

how do you detect whether certain code change would break SIMD

I use godbolt with mocked up functions a lot, and then confirm any changes with benchmarks. Very occasionally I use gdb to disassemble symbols, as certain things like inlining heuristics are hard to mock up.

Note that you need to override the default target, e.g. target-cpu=haswell to get any SIMD instructions added since the pentium 4

@viirya
Copy link
Member

viirya commented Jan 27, 2023

I think cargo asm can be used to check the resulting assembly to check if SIMD is applied.

@alamb
Copy link
Contributor

alamb commented Jan 28, 2023

For example, late materialization with support for a form of this short-circuiting, was recently added to the parquet reader. We could theoretically build upon this base, without needing to make more intrusive modifications.

I believe this refers to https://docs.rs/parquet/31.0.0/parquet/arrow/arrow_reader/struct.RowSelector.html and https://docs.rs/parquet/31.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html

@jhorstmann
Copy link
Contributor

This paper about Fused Table Scans describes a vectorized implementation of evaluating two predicates more efficiently. The first predicate is evaluted into indices instead of a bitmap, these indices are then used to gather data into simd registers for the second predicate.

The improvement probably depends on the selectivity of the first predicate and doing this optimally would require some statistics about the input arrays.

@alamb
Copy link
Contributor

alamb commented Apr 11, 2023

Here is a discussion in DataFusion about something similar: apache/datafusion#5944

@tustvold tustvold changed the title Implement short-circuiting for filter evaluation Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation Jun 14, 2023
@tustvold
Copy link
Contributor

Thinking about this a bit more, the intention of a selection vector is to allow a kernel to skip an expensive computation, such as a string comparison or regex evaluation, when the result is unimportant because we know it is going to be discarded. For some kernels the cost of consulting the selection vector will outweigh any savings, especially for kernels like integer comparison where it interferes with vectorisation.

Now the potentially interesting observation is the exact same principle also holds for null masks, we shouldn't spend time performing expensive evaluation on null slots. I think we currently do in some cases, but this should be easy to fix.

This then leads to the obvious question, if a false value in a selection vector indicates that the result doesn't matter, how would the semantics of an operation under a selection vector differ from the semantics of an operation with the arrays first passed to nullif with the selection vector. As the result is irrelevant, why would its null-ness matter?

@alamb
Copy link
Contributor

alamb commented Jun 14, 2023

the exact same principle also holds for null masks

I think they are "almost always" the same. For example the (non null) values of or_kleen's depend on the null values of its input

However, it is an excellent point that for many kernels, the implementation is probably exactly the same after "AND"ing together the validity mask and selection vector.

@alamb
Copy link
Contributor

alamb commented Jun 14, 2023

Related discussion: #4393 (comment)

@tustvold
Copy link
Contributor

For example the (non null) values of or_kleen's depend on the null values of its input

A value not being in the selection mask implies it is going to be discarded regardless of the output of the current kernel. I'm therefore not sure why it would ever matter? What am I missing here?

@alamb
Copy link
Contributor

alamb commented Jun 14, 2023

A value not being in the selection mask implies it is going to be discarded regardless of the output of the current kernel. I'm therefore not sure why it would ever matter? What am I missing here?

Correct -- I thought you were asking if there were semantic differences between a selection vector and a null mask and so I was providing an example where there would be. I probably misunderstood this:

This then leads to the obvious question, if a false value in a selection vector indicates that the result doesn't matter, how would the semantics of an operation under a selection vector differ from the semantics of an operation with the arrays first passed to nullif with the selection vector. As the result is irrelevant, why would its null-ness matter?

@alamb
Copy link
Contributor

alamb commented Jun 14, 2023

What about something like this (where you conditionally generate an error if the row is included in the computation)?

CASE 
  WHEN x IS NOT NULL
  THEN x
  ELSE x / 0 -- <--- should never be hit / error
END

@tustvold
Copy link
Contributor

tustvold commented Jun 14, 2023

where you conditionally generate an error if the row is included in the computation

I think you would need something that would side-effect for nulls in order to cause an issue for the approach of encoding the selection vector as a null mask. I'm struggling to think of a kernel within arrow-rs where this would be the case... All the non-side effect free kernels should only consider valid slots, if they aren't that is a bug as the value of a null slot can be arbitrary including any problematic values.

@mbutrovich
Copy link

Now the potentially interesting observation is the exact same principle also holds for null masks, we shouldn't spend time performing expensive evaluation on null slots. I think we currently do in some cases, but this should be easy to fix.

This was a prescient comment. Here's also a recent (DaMoN 2024) evaluation of NULL representations:

"NULLS! Revisiting Null Representation in Modern Columnar Formats"
https://db.cs.cmu.edu/papers/2024/zeng-damon24.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

6 participants