Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

sunchao · 2023-01-27T18:18:28Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently for common expressions such as AND or OR, we don't apply any short-circuiting and therefore the same columnar batch needs to be fully evaluated on every predicate.

For instance, consider the following example:

a = 'FOO' AND b = 42

We would evaluate each batch on both predicates, and apply bit-and on the result BooleanArray from both side. Similarly for OR.

This would not be efficient in many cases, nor correct (see apache/datafusion#5093 for a bug report). A more efficient approach, is perhaps to only apply the second predicate on the remaining rows from the evaluation of the first predicate. This could be especially effective if the first predicate has low selectivity.

Note, sometimes it would still be beneficial to evaluate the full batch to take advantage of SIMD. For a detailed analysis, please check https://dl.acm.org/doi/abs/10.1145/3465998.3466009.

This approach has been adopted by other popular engines such as Velox, Databricks Photon, etc.

Describe the solution you'd like

Implement the short-circuiting logic in both arrow-rs and arrow-datafusion. This could introduce a lot of API changes since we may need to introduce an extra parameter SelectivityVector for related compute kernels (e.g., is_null). We also need to change arrow-datafusion's PhysicalExpr::evaluate to take the SelectivityVector into account. Note similar features has been done for CASE WHEN with the introduction of a separate evaluate_selection method. See apache/datafusion#2068 for more details.

Describe alternatives you've considered

N/A

Additional context

N/A

The text was updated successfully, but these errors were encountered:

tustvold · 2023-01-27T19:01:42Z

One thing to perhaps be aware of is that we are entirely reliant on LLVM to vectorise various kernels. I have found that any branch in the body of the loop, even something as simple as Iterator::next, causes it to fail to vectorise the loop.

This has led to a number of tricks such as PrimitiveArray::unary and MutableBuffer::collect_bool that help it vectorise correctly, by ignoring the null mask. I suspect a selection vector would run into similar challenges. Whilst I don't doubt that custom SIMD kernels could apply selection masks and null masks "for free", I have not found a good way to get LLVM to figure this out.

As such I wonder if a first step might be to implement this solely within DataFusion, in particular using the filter kernel to eagerly apply the selection to the values if the next expression is not side-effect free, or if selectivity has fallen below some threshold? We can then go from there if this filtering starts to become a major bottleneck?

This would also allow us to potentially investigate less intrusive changes, depending on where the filter overheads manifest. For example, late materialization with support for a form of this short-circuiting, was recently added to the parquet reader. We could theoretically build upon this base, without needing to make more intrusive modifications.

I think it would be really cool to support this, but my experience fighting LLVM over null masks, the speed of the filter kernels, and the reality that a lot of queries end up bottlenecked on sorting or decoding, makes me think there may be mileage in the naive approach. I'm not expert on query engines though, so happy to defer to others 😄

sunchao · 2023-01-27T19:38:20Z

Thanks @tustvold !

Yes, I think it's a good idea to start with a PoC in DataFusion only. I'll try to see if we can get some good numbers with the approach using some synthetic benchmarks :)

One question: how do you detect whether certain code change would break SIMD? is there any convenient way of doing that?

I'll take a look at the lazy materialization on Parquet side and see how it can interact with this feature.

I think it would be really cool to support this, but my experience fighting LLVM over null masks, the speed of the filter kernels, and the reality that a lot of queries end up bottlenecked on sorting or decoding, makes me think there may be mileage in the naive approach. I'm not expert on query engines though, so happy to defer to others 😄

Agree. My feeling is also that many queries are actually bottlenecked on somewhere else like join or aggregation. It just caught my attention while I'm looking at DataFusion and arrow-rs.

tustvold · 2023-01-27T19:56:22Z

how do you detect whether certain code change would break SIMD

I use godbolt with mocked up functions a lot, and then confirm any changes with benchmarks. Very occasionally I use gdb to disassemble symbols, as certain things like inlining heuristics are hard to mock up.

Note that you need to override the default target, e.g. target-cpu=haswell to get any SIMD instructions added since the pentium 4

viirya · 2023-01-27T21:55:03Z

I think cargo asm can be used to check the resulting assembly to check if SIMD is applied.

alamb · 2023-01-28T16:13:06Z

For example, late materialization with support for a form of this short-circuiting, was recently added to the parquet reader. We could theoretically build upon this base, without needing to make more intrusive modifications.

I believe this refers to https://docs.rs/parquet/31.0.0/parquet/arrow/arrow_reader/struct.RowSelector.html and https://docs.rs/parquet/31.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html

jhorstmann · 2023-01-30T09:20:02Z

This paper about Fused Table Scans describes a vectorized implementation of evaluating two predicates more efficiently. The first predicate is evaluted into indices instead of a bitmap, these indices are then used to gather data into simd registers for the second predicate.

The improvement probably depends on the selectivity of the first predicate and doing this optimally would require some statistics about the input arrays.

alamb · 2023-04-11T14:02:33Z

Here is a discussion in DataFusion about something similar: apache/datafusion#5944

tustvold · 2023-06-14T12:11:22Z

Thinking about this a bit more, the intention of a selection vector is to allow a kernel to skip an expensive computation, such as a string comparison or regex evaluation, when the result is unimportant because we know it is going to be discarded. For some kernels the cost of consulting the selection vector will outweigh any savings, especially for kernels like integer comparison where it interferes with vectorisation.

Now the potentially interesting observation is the exact same principle also holds for null masks, we shouldn't spend time performing expensive evaluation on null slots. I think we currently do in some cases, but this should be easy to fix.

This then leads to the obvious question, if a false value in a selection vector indicates that the result doesn't matter, how would the semantics of an operation under a selection vector differ from the semantics of an operation with the arrays first passed to nullif with the selection vector. As the result is irrelevant, why would its null-ness matter?

alamb · 2023-06-14T13:24:06Z

the exact same principle also holds for null masks

I think they are "almost always" the same. For example the (non null) values of or_kleen's depend on the null values of its input

However, it is an excellent point that for many kernels, the implementation is probably exactly the same after "AND"ing together the validity mask and selection vector.

alamb · 2023-06-14T13:24:25Z

Related discussion: #4393 (comment)

tustvold · 2023-06-14T13:29:35Z

For example the (non null) values of or_kleen's depend on the null values of its input

A value not being in the selection mask implies it is going to be discarded regardless of the output of the current kernel. I'm therefore not sure why it would ever matter? What am I missing here?

alamb · 2023-06-14T13:56:27Z

A value not being in the selection mask implies it is going to be discarded regardless of the output of the current kernel. I'm therefore not sure why it would ever matter? What am I missing here?

Correct -- I thought you were asking if there were semantic differences between a selection vector and a null mask and so I was providing an example where there would be. I probably misunderstood this:

This then leads to the obvious question, if a false value in a selection vector indicates that the result doesn't matter, how would the semantics of an operation under a selection vector differ from the semantics of an operation with the arrays first passed to nullif with the selection vector. As the result is irrelevant, why would its null-ness matter?

alamb · 2023-06-14T13:59:57Z

What about something like this (where you conditionally generate an error if the row is included in the computation)?

CASE 
  WHEN x IS NOT NULL
  THEN x
  ELSE x / 0 -- <--- should never be hit / error
END

tustvold · 2023-06-14T14:11:56Z

where you conditionally generate an error if the row is included in the computation

I think you would need something that would side-effect for nulls in order to cause an issue for the approach of encoding the selection vector as a null mask. I'm struggling to think of a kernel within arrow-rs where this would be the case... All the non-side effect free kernels should only consider valid slots, if they aren't that is a bug as the value of a null slot can be arbitrary including any problematic values.

mbutrovich · 2024-08-12T14:38:51Z

Now the potentially interesting observation is the exact same principle also holds for null masks, we shouldn't spend time performing expensive evaluation on null slots. I think we currently do in some cases, but this should be easy to fix.

This was a prescient comment. Here's also a recent (DaMoN 2024) evaluation of NULL representations:

"NULLS! Revisiting Null Representation in Modern Columnar Formats"
https://db.cs.cmu.edu/papers/2024/zeng-damon24.pdf

sunchao added the enhancement Any new improvement worthy of a entry in the changelog label Jan 27, 2023

tustvold mentioned this issue Apr 10, 2023

Fuse grouped aggregate and filter operators for improved performance apache/datafusion#5944

Closed

This was referenced Jun 14, 2023

Improved Handling of Sparse Dictionaries #4414

Closed

Arrow compute kernel regards selection vector #4095

Closed

tustvold changed the title ~~Implement short-circuiting for filter evaluation~~ Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation Jun 14, 2023

viirya mentioned this issue Jul 31, 2024

Evaluate use of selection vectors in scan-filter-join operations apache/datafusion-comet#745

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

sunchao commented Jan 27, 2023 •

edited

Loading

tustvold commented Jan 27, 2023 •

edited

Loading

sunchao commented Jan 27, 2023

tustvold commented Jan 27, 2023 •

edited

Loading

viirya commented Jan 27, 2023

alamb commented Jan 28, 2023

jhorstmann commented Jan 30, 2023

alamb commented Apr 11, 2023

tustvold commented Jun 14, 2023

alamb commented Jun 14, 2023

alamb commented Jun 14, 2023

tustvold commented Jun 14, 2023

alamb commented Jun 14, 2023

alamb commented Jun 14, 2023

tustvold commented Jun 14, 2023 •

edited

Loading

mbutrovich commented Aug 12, 2024

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation #3620

Comments

sunchao commented Jan 27, 2023 • edited Loading

tustvold commented Jan 27, 2023 • edited Loading

sunchao commented Jan 27, 2023

tustvold commented Jan 27, 2023 • edited Loading

viirya commented Jan 27, 2023

alamb commented Jan 28, 2023

jhorstmann commented Jan 30, 2023

alamb commented Apr 11, 2023

tustvold commented Jun 14, 2023

alamb commented Jun 14, 2023

alamb commented Jun 14, 2023

tustvold commented Jun 14, 2023

alamb commented Jun 14, 2023

alamb commented Jun 14, 2023

tustvold commented Jun 14, 2023 • edited Loading

mbutrovich commented Aug 12, 2024

sunchao commented Jan 27, 2023 •

edited

Loading

tustvold commented Jan 27, 2023 •

edited

Loading

tustvold commented Jan 27, 2023 •

edited

Loading

tustvold commented Jun 14, 2023 •

edited

Loading