Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a ScalarUDFImpl::simplfy() API, move SimplifyInfo et al to datafusion_expr #9304

Merged
merged 30 commits into from
Mar 5, 2024

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Closes #9289 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Feb 21, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review February 21, 2024 15:01
@jayzhan211 jayzhan211 marked this pull request as draft February 23, 2024 03:16
@github-actions github-actions bot added the optimizer Optimizer rules label Feb 23, 2024
}
}

fn function_simplication_analyze_internal(plan: &LogicalPlan) -> Result<LogicalPlan> {
Copy link
Contributor Author

@jayzhan211 jayzhan211 Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port from operator to function, schema is removed since not used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any reason calling simplify needs to be its own analyzer pass? Can't it be called as part of the existing simplification code?

logical_plan after simplify_expressions SAME TEXT AS ABOVE

That would also allow this function to be used by anyone else that used the simplification API directly: https://github.com/apache/arrow-datafusion/blob/8f3d1ef23f93cd4303745eba76c0850b39774d07/datafusion-examples/examples/expr_api.rs#L113

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No special reason. I forgot this analyzer exists. Let me take a look

@github-actions github-actions bot removed the physical-expr Physical Expressions label Feb 23, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review February 23, 2024 12:02
@jayzhan211 jayzhan211 marked this pull request as draft February 23, 2024 12:38
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 23, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review February 24, 2024 00:17
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- this is looking really nice and close. Thank you so much for working on these items and I again apologize for the delay in review.

I left some comments -- let me know what you think.

}
// Wrap with Expr::Cast() to Int64
fn simplify(&self, args: &[Expr]) -> Result<Simplified> {
let dfs = DFSchema::new_with_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great example, thank you @jayzhan211

It seems to me that the UDF can't always know what the input schema would be, and thus the schema should be passed in.

Something like this

    fn simplify(&self, args: &[Expr], schema: &DFSchema) -> Result<Simplified> {
...
     }

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpliyContext has DFSchemaRef, so I change the type to DFSchemaRef, I think it is also fine

}

#[tokio::test]
async fn test_user_defined_functions_cast_to_i64() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this test / example -- it makes seeing how the API would work really clear. 👏

Comment on lines 358 to 355
fn simplify(&self, _args: &[Expr]) -> Result<Simplified> {
Ok(Simplified::Original)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this take self (and avoid a copy)? Something like

Suggested change
fn simplify(&self, _args: &[Expr]) -> Result<Simplified> {
Ok(Simplified::Original)
}
fn simplify(self, _args: &[Expr]) -> Result<Simplified> {
Ok(Simplified::Original(self))
}
```?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems quite hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-02-28 at 08 43 25

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I +Sized for UDF trait, it can't be made into an object.

}
}

fn function_simplication_analyze_internal(plan: &LogicalPlan) -> Result<LogicalPlan> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any reason calling simplify needs to be its own analyzer pass? Can't it be called as part of the existing simplification code?

logical_plan after simplify_expressions SAME TEXT AS ABOVE

That would also allow this function to be used by anyone else that used the simplification API directly: https://github.com/apache/arrow-datafusion/blob/8f3d1ef23f93cd4303745eba76c0850b39774d07/datafusion-examples/examples/expr_api.rs#L113

@jayzhan211 jayzhan211 marked this pull request as draft February 27, 2024 15:00
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Feb 28, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review February 28, 2024 12:11
@alamb
Copy link
Contributor

alamb commented Feb 28, 2024

Thanks @jayzhan211 -- I have some more thoughts on this PR and I hope to share them soon. However it may not be until tomorrow as I have a bunch of other things going on right now so I may run out of time

@alamb
Copy link
Contributor

alamb commented Feb 29, 2024

I ran out of time (again) today, but I will put this at the top of my list for tomorrow

// Do the function rewrite.
// 'args': The arguments of the function
// 'schema': The schema of the function
fn simplify(&self, _args: &[Expr], _schema: DFSchemaRef) -> Result<Simplified> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this PR today.

I think the core challenge is that the Simplification code uses SimplifyContext which purposely doesn't have a reference to DFSchema. This means the simplifier can't call ScalarUDFImpl::simplify as written.

I tried several options, and here is my suggestion:

  1. Move SimplifyContext into datafusion_expr
  2. Change the signature to fn simplify(&self, _args: &[Expr], info: &dyn SimplifyContext) -> Result<Simplified>

Then I think calling ScalarUDFImpl::simplify will end up being straightforward to call

The one challenge I found when I tried to do this locally is that SImplifyInfo currently depends on ExecutionPropss which is in datafusion-physical-expr -- we probably have to move that structure into DataFusion common or datafusion expr as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Mar 1, 2024

I view this PR as strategically important (as it is needed to unlock the migration of functions out of the core)

I think figuring out how to get SimplifyContext out of datafusion/optimizer is the key. It is a matter of picking in which crates ExecutionProps and VarProvider go -- I can try and help tomorrow.

{
let schema = self
.info
.schema()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get the DFSchema from SimplifyContext here.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Mar 3, 2024

I pushed some commits to this PR @jayzhan211 -- I hope you don't mind.

Do you think the changes make sense ?

Thank you @alamb , I don't mind that.
I'm not sure why we should avoid returning schema, others make sense to me

@@ -1233,6 +1272,19 @@ impl<'a, S: SimplifyInfo> TreeNodeRewriter for Simplifier<'a, S> {
out_expr.rewrite(self)?
}

Expr::ScalarFunction(ScalarFunction {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb
Copy link
Contributor

alamb commented Mar 3, 2024

I'm not sure why we should avoid returning schema, others make sense to me

I was thinking we should stay true to the design intent on SimplifyInfo that says it was designed to avoid having to create DFSchema objects.

If we were going to require DFSChema I think we could simply get rid of SimplifyInfo entirely 🤔

/// The information necessary to apply algebraic simplification to an
/// [Expr]. See [SimplifyContext] for one concrete implementation.
///
/// This trait exists so that other systems can plug schema
/// information in without having to create `DFSchema` objects. If you
/// have a [`DFSchemaRef`] you can use [`SimplifyContext`]
pub trait SimplifyInfo {

@alamb alamb changed the title Add a ScalarUDFImpl::simplfy() API Add a ScalarUDFImpl::simplfy() API, move SimplifyInfo et al to datafusion_expr Mar 3, 2024
@jayzhan211
Copy link
Contributor Author

LGTM

@alamb
Copy link
Contributor

alamb commented Mar 4, 2024

I plan to resolve the conflicts with this PR later today and merge this PR unless there are any additional comments.

// Casting should be done in `simplify`, so we just return the first argument
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
assert_eq!(args.len(), 1);
Ok(args.first().unwrap().clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add ExprSimplifyResult::Replace(Expr) to cover this case, and eliminate UDF call when expression is simplified?

Copy link
Contributor

@milenkovicm milenkovicm Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put some context to my comment, let's say if we define function f(INT, INT) = $1 + $2 we can eliminate UDF call with Alias($1 + $2, "f(a,b)") and get UDF free plan, which would be easier to distribute across ballista cluster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is an excellent idea -- I did so in fea82cb

// Casting should be done in `simplify`, so we just return the first argument
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
assert_eq!(args.len(), 1);
Ok(args.first().unwrap().clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is an excellent idea -- I did so in fea82cb

milenkovicm added a commit to milenkovicm/arrow-datafusion-fork that referenced this pull request Mar 5, 2024
... `DROP FUNCTION` will look for function name
in all available registries (udf, udaf, udwf).

`remove` may be necessary if UDaF and UDwF do not
get `simplify` method from apache#9304.
Ok(ExprSimplifyResult::Simplified(new_expr))
}

// Casting should be done in `simplify`, so we just return the first argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is a bit outdated

@alamb
Copy link
Contributor

alamb commented Mar 5, 2024

Here we go 🚀 Thank you again @jayzhan211 for all the help getting this one tee'd up and ready to merge. I think this will be a very powerful (but simple) API

@alamb alamb merged commit 2873fd0 into apache:main Mar 5, 2024
24 checks passed
alamb added a commit that referenced this pull request Mar 5, 2024
* Add plugable function factory

* cover `DROP FUNCTION` as well ...

... partially, as `SessionState` does not expose
unregister_udf at the moment.

* update documentation

* fix doc test

* Address PR comments (code organization)

* Address PR comments (factory interface)

* fix test after rebase

* `remove`'s gone from the trait ...

... `DROP FUNCTION` will look for function name
in all available registries (udf, udaf, udwf).

`remove` may be necessary if UDaF and UDwF do not
get `simplify` method from #9304.

* Rename FunctionDefinition and export it ...

FunctionDefinition already exists, DefinitionStatement makes more sense.

* Update datafusion/expr/src/logical_plan/ddl.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/src/execution/context/mod.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/expr/src/logical_plan/ddl.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* resolve part of follow up comments

* Qualified functions are not supported anymore

* update docs and todos

* fix clippy

* address additional comments

* Add sqllogicteset for CREATE/DROP function

* Add coverage for DROP FUNCTION IF EXISTS

* fix multiline error

* revert dialect back to generic in test ...

... as `create function` gets support in latest
sqlparser.

* fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Comment on lines +1309 to +1317
}) => match udf.simplify(args, info)? {
ExprSimplifyResult::Original(args) => {
Transformed::yes(Expr::ScalarFunction(ScalarFunction {
func_def: ScalarFunctionDefinition::UDF(udf),
args,
}))
}
ExprSimplifyResult::Simplified(expr) => Transformed::no(expr),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a bit late to the party on this one but I keep rereading this and I can't help but think the logic is inverted - Original means unchanged which should result in Transformed::no, shouldn't it? And Simplified mean changed which should result in Transformed::yes?

I did a test where I switched the yes -> no, no -> yes and cargo test passes which makes me wonder if this code is indeed getting tested properly 🤔

Or am I completely missing something (very likely)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct that the logic is inverted. I wil make a PR to correct it.

I did a test where I switched the yes -> no, no -> yes and cargo test passes which makes me wonder if this code is indeed getting tested properly 🤔

It is a good question about why no tests fail if Transformed::no

I think it is because the expr simplifer code doens't do anything different if the expression was transformed or not: https://github.com/apache/arrow-datafusion/blob/f3836a53122e86f2b73c25557deaa5b800e488e9/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L188-L189

It just looks at data:
https://github.com/apache/arrow-datafusion/blob/f3836a53122e86f2b73c25557deaa5b800e488e9/datafusion/common/src/tree_node.rs#L456

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a ScalarUDFImpl::simplfy() API
4 participants