Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support for AnyExpression #2549

Closed
wants to merge 1 commit into from
Closed

Conversation

ovr
Copy link
Contributor

@ovr ovr commented May 16, 2022

Which issue does this PR close?

#2548

Rationale for this change

This PR implements partial support for ANY operator, only for = & <>.

image

Are there any user-facing changes?

This PR doesn't introduce any breaking changes, only new functionality

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label May 16, 2022
@andygrove andygrove added the api change Changes the API exposed to users of the crate label May 16, 2022
@andygrove
Copy link
Member

I added the api change label since this introduces a new AnyExpr variant in the Expr enum, which is a breaking change.

@ovr
Copy link
Contributor Author

ovr commented May 16, 2022

@alamb @andygrove is it required to add a new expression for Ballista? I did that in 218a917, but I don't use it. Thanks

@andygrove
Copy link
Member

@alamb @andygrove is it required to add a new expression for Ballista? I did that in 218a917, but I don't use it. Thanks

I assume you had to add that to fix compilation issues because ballista has exhaustive matching of all expressions? The changes look reasonable to me.

Note that we also have a datafusion.proto (in fact, we have two copies, as noted in #2514) that we should ideally also add new expressions to (but we have not been doing this consistently).

@alamb
Copy link
Contributor

alamb commented May 18, 2022

I plan to review this carefully today -- thanks @ovr

@alamb alamb changed the title feat: Initial support for AnyExpression feat: support for AnyExpression May 18, 2022
Comment on lines +334 to +338
fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
let value = match self.value.evaluate(batch)? {
ColumnarValue::Array(array) => array,
ColumnarValue::Scalar(scalar) => scalar.to_array(),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ovr -- this is looking quite cool.

I haven't reviewed this code super carefully but I wonder what you think about reusing more of the code for InList?

https://github.com/cube-js/arrow-datafusion/blob/binary-df-pr/datafusion/physical-expr/src/expressions/in_list.rs#L439

It seems like the implementations of x IN (<expr>, <expr>) and x ANY (<expr>) are almost exactly the same except for the fact that the the x = ANY(<expr>) has the list as a single <expr> ?

The implementation of get_set needs to be different, but otherwise the rest of the code could probably be called directly.

I mention this as it is likely the IN / NOT IN are more optimized (use a hashset for testing, for example) as well as already written.

Let me know what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed this code super carefully but I wonder what you think about reusing more of the code for InList?

Probably, the nearest operator will be ALL, but it's not implemented right now. I tried to reuse the code of InList, but come to the decision that it will be easier/correct to combine it with ANY in the future.

It seems like the implementations of x IN (, ) and x ANY () are almost exactly the same except for the fact that the the x = ANY() has the list as a single ?

Nope, because it supports more operators: <, <=, >, >= (it's not implemented in this PR, but anyway).

I mention this as it is likely the IN / NOT IN are more optimized (use a hashset for testing, for example) as well as already written.

Yes, but IN supports only the vector of scalars in DF. It's not the same as ANY, because it supports List (real column) or scalar.

For example SELECT left, values FROM table:

left | right LIst(Int64) - another logic is required
1.    | [1,2,3] - ArrayRef = PrimitiveArray<Int64> - another logic is requied
2.   |. [2,3,4]
3.   |. [4,5,6]
4.   |  [4,5,6]
5.   |. [4,5,6]

In the case of using the whole column, it is required to use another handling of interacting on values ☝️ . Probably here, will be more correct to go by steps:

  • Support more operators
  • Introduce ALL operator + extract the same logic with ANY.

Thanks

@andygrove
Copy link
Member

@ovr could you merge latest from master into this branch to pick up #2567

@alamb
Copy link
Contributor

alamb commented May 20, 2022 via email

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner labels Jun 2, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #2549 (2d32cb2) into master (4b3eb1c) will decrease coverage by 0.09%.
The diff coverage is 69.00%.

@@            Coverage Diff             @@
##           master    #2549      +/-   ##
==========================================
- Coverage   84.64%   84.54%   -0.10%     
==========================================
  Files         267      268       +1     
  Lines       46926    47239     +313     
==========================================
+ Hits        39719    39937     +218     
- Misses       7207     7302      +95     
Impacted Files Coverage Δ
datafusion/core/src/datasource/listing/helpers.rs 95.34% <ø> (ø)
...ion/core/src/optimizer/common_subexpr_eliminate.rs 87.81% <0.00%> (-0.74%) ⬇️
...afusion/core/src/optimizer/simplify_expressions.rs 81.90% <0.00%> (-0.10%) ⬇️
datafusion/core/src/optimizer/utils.rs 31.77% <0.00%> (-0.49%) ⬇️
datafusion/expr/src/utils.rs 91.79% <ø> (ø)
datafusion/physical-expr/src/expressions/mod.rs 100.00% <ø> (ø)
datafusion/proto/src/from_proto.rs 33.94% <0.00%> (-0.16%) ⬇️
datafusion/proto/src/to_proto.rs 55.74% <0.00%> (-0.59%) ⬇️
datafusion/sql/src/utils.rs 50.19% <0.00%> (-0.79%) ⬇️
datafusion/expr/src/expr_schema.rs 66.88% <16.66%> (-2.04%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b3eb1c...2d32cb2. Read the comment docs.

@ovr
Copy link
Contributor Author

ovr commented Jun 2, 2022

Rebased ✅ cC @alamb @tustvold

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something fundamental, but I'm struggling to follow why this is handling ListArrays at all... This makes it quite hard for me to review 😅

@@ -955,6 +955,30 @@ async fn test_extract_date_part() -> Result<()> {
Ok(())
}

#[tokio::test]
async fn test_binary_any() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth testing what happens if NULL is in the list?

/// The comparison operator
op: Operator,
/// Right-hand side of the expression
right: Box<Expr>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing supporting an arbitrary subquery here is a future extension?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Expr here could be an Expr::ScalarSubquery to cover the subquery case (assuming that we only need to support subqueries that select a single column?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of AnyExpr here is identical to the definition of BinaryExpr. I wonder if it would be better to use BinaryExpr for consistency and model Any as AnyExpr(Expr) to just model the right-hand side?

ColumnarValue::Array(array) => (array, false),
ColumnarValue::Scalar(scalar) => (scalar.to_array(), true),
};
let as_list = list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something fundamental, but I'm confused as to why this is a ListArray, i.e. 2-dimensional.

In all the examples given the expression is [1,2,3] which is 1 dimensional?

}
};

match value.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on if this needs to be using ListArray, this could possibly use the dyn comparison kernels in arrow to reduce the amount of code.

let col_a = col("a", &schema)?;

let values_builder = Int64Builder::new(3 * 3);
let mut builder = ListBuilder::new(values_builder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ListArray::from_iter_primitive can be easier to read than the builders.

@andygrove
Copy link
Member

I reviewed the logical plan, SQL parsing, and optimizer changes and LGTM. I did not review the execution part.

It might make reviews easier for future work like this to have separate PRs for logical versus physical parts. I would have been more comfortable reviewing and approving the logical plan parts of this.

@andygrove andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022
@andygrove
Copy link
Member

Hi @ovr What do you think about breaking this up into smaller PRs and starting with the logical plan changes and SQL parsing? I think this might make it easier to start getting parts of this merged in.

@tustvold tustvold marked this pull request as draft June 28, 2022 15:39
@alamb
Copy link
Contributor

alamb commented Jan 14, 2023

This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more

@alamb alamb closed this Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants