Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move coalesce to datafusion-functions and remove BuiltInScalarFunction #10098

Merged
merged 4 commits into from
Apr 22, 2024

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Apr 16, 2024

Which issue does this PR close?

Closes #10090
Closes #9285
Closes #8045

Rationale for this change

Complete builtin scalar function migration

What changes are included in this PR?

Source, tests.

Are these changes tested?

Yes, all tests and benchmarks pass. I would like reviewers to please go over this in detail as a lot was changed and removed in this PR. I want to especially note the changes in the proto module for which I am uncertain as to whether backwards compatibility is a requirement (and if so, it definitely will break that).

Are there any user-facing changes?

BuiltinScalarFunction is no more. I'm pretty sure that will be a breaking change for many users that extend DF.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait labels Apr 16, 2024
@@ -1276,7 +1260,7 @@ impl Expr {
pub fn short_circuits(&self) -> bool {
match self {
Expr::ScalarFunction(ScalarFunction { func_def, .. }) => {
matches!(func_def, ScalarFunctionDefinition::BuiltIn(fun) if *fun == BuiltinScalarFunction::Coalesce)
matches!(func_def, ScalarFunctionDefinition::UDF(fun) if fun.name().eq("coalesce"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a good way to directly refer to the coalesce function here so went with checking by name. There is another by-name check in /physical-expr/src/scalar_function.rs for make_array so I believe this is acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"m thinking if its possible if user creates udf with coalesce name

Copy link
Contributor Author

@Omega359 Omega359 Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that and my thought was that if that was done it would replace the existing coalesce function and one would think it would (hopefully) have similar functionality. If there is another way to determine it here it should be a reasonable change. Just importing the crate though isn't possible as best as I can tell because that would cause a circular dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the new API is_short_circuits()(default to false, when need set to true) to ScalarUDF and ScalarUDFImpl might be a good way to do this because users may want to define their own short-circuit functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an option but I think if we go that route it should be it's own PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that @haohuaijin 's idea is a good one (and keeping with the spirit of "anything we can do with built in functions can be done with ScalarAPIs"). I filed #10162 to track

input_schema,
&execution_props,
)?
return Err(proto_error(format!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only value left for ScalarFunction in the datafusion.proto is 'unknown' this implementation seemed reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could remove the ScalarFunction part of proto too

In a follow on PR we could probably remove

enum ScalarFunction {
// 0 was Abs before
// The first enum value must be zero for open enums
unknown = 0;
// 1 was Acos
// 2 was Asin
// 3 was Atan
// 4 was Ascii
// 5 was Ceil
// 6 was Cos
// 7 was Digest
// 8 was Exp
// 9 was Floor
// 10 was Ln
// 11 was Log
// 12 was Log10
// 13 was Log2
// 14 was Round
// 15 was Signum
// 16 was Sin
// 17 was Sqrt
// Tan = 18;
// 19 was Trunc
// 20 was Array
// RegexpMatch = 21;
// 22 was BitLength
// 23 was Btrim
// 24 was CharacterLength
// 25 was Chr
// 26 was Concat
// 27 was ConcatWithSeparator
// 28 was DatePart
// 29 was DateTrunc
// 30 was InitCap
// 31 was Left
// 32 was Lpad
// 33 was Lower
// 34 was Ltrim
// 35 was MD5
// 36 was NullIf
// 37 was OctetLength
// 38 was Random
// 39 was RegexpReplace
// 40 was Repeat
// 41 was Replace
// 42 was Reverse
// 43 was Right
// 44 was Rpad
// 45 was Rtrim
// 46 was SHA224
// 47 was SHA256
// 48 was SHA384
// 49 was SHA512
// 50 was SplitPart
// StartsWith = 51;
// 52 was Strpos
// 53 was Substr
// ToHex = 54;
// 55 was ToTimestamp
// 56 was ToTimestampMillis
// 57 was ToTimestampMicros
// 58 was ToTimestampSeconds
// 59 was Now
// 60 was Translate
// Trim = 61;
// Upper = 62;
Coalesce = 63;
// 64 was Power
// 65 was StructFun
// 66 was FromUnixtime
// 67 Atan2
// 68 was DateBin
// 69 was ArrowTypeof
// 70 was CurrentDate
// 71 was CurrentTime
// 72 was Uuid
// 73 was Cbrt
// 74 Acosh
// 75 was Asinh
// 76 was Atanh
// 77 was Sinh
// 78 was Cosh
// Tanh = 79
// 80 was Pi
// 81 was Degrees
// 82 was Radians
// 83 was Factorial
// 84 was Lcm
// 85 was Gcd
// 86 was ArrayAppend
// 87 was ArrayConcat
// 88 was ArrayDims
// 89 was ArrayRepeat
// 90 was ArrayLength
// 91 was ArrayNdims
// 92 was ArrayPosition
// 93 was ArrayPositions
// 94 was ArrayPrepend
// 95 was ArrayRemove
// 96 was ArrayReplace
// 97 was ArrayToString
// 98 was Cardinality
// 99 was ArrayElement
// 100 was ArraySlice
// 103 was Cot
// 104 was ArrayHas
// 105 was ArrayHasAny
// 106 was ArrayHasAll
// 107 was ArrayRemoveN
// 108 was ArrayReplaceN
// 109 was ArrayRemoveAll
// 110 was ArrayReplaceAll
// 111 was Nanvl
// 112 was Flatten
// 113 was IsNan
// 114 was Iszero
// 115 was ArrayEmpty
// 116 was ArrayPopBack
// 117 was StringToArray
// 118 was ToTimestampNanos
// 119 was ArrayIntersect
// 120 was ArrayUnion
// 121 was OverLay
// 122 is Range
// 123 is ArrayExcept
// 124 was ArrayPopFront
// 125 was Levenshtein
// 126 was SubstrIndex
// 127 was FindInSet
// 128 was ArraySort
// 129 was ArrayDistinct
// 130 was ArrayResize
// 131 was EndsWith
// 132 was InStr
// 133 was MakeDate
// 134 was ArrayReverse
// 135 is RegexpLike
// 136 was ToChar
// 137 was ToDate
// 138 was ToUnixtime
}
message ScalarFunctionNode {
ScalarFunction fun = 1;
repeated LogicalExprNode args = 2;
}

And the references to ScalarFunctionNode


let ctx = SessionContext::new();
let lit = Expr::Literal(ScalarValue::Utf8(None));
for builtin_fun in BuiltinScalarFunction::iter() {
for function in string::functions() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using all the string functions here seemed like a reasonable choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I didn't remove ScalarFunction from this file as I think that is a change best left to another PR.

@Omega359 Omega359 closed this Apr 16, 2024
@Omega359 Omega359 reopened this Apr 16, 2024
@Omega359 Omega359 marked this pull request as ready for review April 16, 2024 17:09
@andygrove andygrove added the api change Changes the API exposed to users of the crate label Apr 16, 2024
@liukun4515
Copy link
Contributor

liukun4515 commented Apr 18, 2024

It's the big milestone

Thanks, @Omega359

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @Omega359 -- it has been quite a ride but it is amazing to see this project at the completion.

I updated the description of this PR to close a few more tickets as well as filed a few follow on cleanups I think we can do.

I'll plan to merge this PR tomorrow unless anyone else would like time to review

Thanks again to everyone who helped make this happen

/// Resolved to a `BuiltinScalarFunction`
/// There is plan to migrate `BuiltinScalarFunction` to UDF-based implementation (issue#8045)
/// This variant is planned to be removed in long term
BuiltIn(BuiltinScalarFunction),
/// Resolved to a user defined function
UDF(Arc<crate::ScalarUDF>),
/// A scalar function constructed with name. This variant can not be executed directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I don't think Name is ever used anymore -- we could eventually simply remove ScalarFunctionDefinition to avoid another level of wrapping.

However, I think there has been enough API churn for a while. We can maybe clean that up as a follow on PR in a few months or somthing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On argument against keeping it: if it's unused and you keep it around, someone will likely use it again and make the removal even harder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #10175

@@ -1276,7 +1260,7 @@ impl Expr {
pub fn short_circuits(&self) -> bool {
match self {
Expr::ScalarFunction(ScalarFunction { func_def, .. }) => {
matches!(func_def, ScalarFunctionDefinition::BuiltIn(fun) if *fun == BuiltinScalarFunction::Coalesce)
matches!(func_def, ScalarFunctionDefinition::UDF(fun) if fun.name().eq("coalesce"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that @haohuaijin 's idea is a good one (and keeping with the spirit of "anything we can do with built in functions can be done with ScalarAPIs"). I filed #10162 to track

@@ -0,0 +1,141 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally expect to find coalesce in the core functions module rather than the math module, but we could do that as a follow on PR (or never).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why I put this in the math module. 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #10174

(true, Err(_))
if self.supports_zero_argument && self.name != "make_array" =>
{
// MakeArray support zero argument but has the different behavior from the array with one null.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange special case (that might be left over). @jayzhan211 or @Weijun-H do you remember if this is still needed?

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should even return empty vec instead of null array here. Those logic should be handled in each function case by case to avoid adding specific assumptions to all the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File #10193 to fix this

@@ -412,16 +411,6 @@ impl From<&protobuf::StringifiedPlan> for StringifiedPlan {
}
}

impl From<&protobuf::ScalarFunction> for BuiltinScalarFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct change -- now all functions are resolved by name.

While this is a breaking API change, most other changes to the protobuf definitions are breaking changes -- see

//! # Version Compatibility
//!
//! The serialized form are not guaranteed to be compatible across
//! DataFusion versions. A plan serialized with one version of DataFusion
//! may not be able to deserialized with a different version.
//!

input_schema,
&execution_props,
)?
return Err(proto_error(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could remove the ScalarFunction part of proto too

In a follow on PR we could probably remove

enum ScalarFunction {
// 0 was Abs before
// The first enum value must be zero for open enums
unknown = 0;
// 1 was Acos
// 2 was Asin
// 3 was Atan
// 4 was Ascii
// 5 was Ceil
// 6 was Cos
// 7 was Digest
// 8 was Exp
// 9 was Floor
// 10 was Ln
// 11 was Log
// 12 was Log10
// 13 was Log2
// 14 was Round
// 15 was Signum
// 16 was Sin
// 17 was Sqrt
// Tan = 18;
// 19 was Trunc
// 20 was Array
// RegexpMatch = 21;
// 22 was BitLength
// 23 was Btrim
// 24 was CharacterLength
// 25 was Chr
// 26 was Concat
// 27 was ConcatWithSeparator
// 28 was DatePart
// 29 was DateTrunc
// 30 was InitCap
// 31 was Left
// 32 was Lpad
// 33 was Lower
// 34 was Ltrim
// 35 was MD5
// 36 was NullIf
// 37 was OctetLength
// 38 was Random
// 39 was RegexpReplace
// 40 was Repeat
// 41 was Replace
// 42 was Reverse
// 43 was Right
// 44 was Rpad
// 45 was Rtrim
// 46 was SHA224
// 47 was SHA256
// 48 was SHA384
// 49 was SHA512
// 50 was SplitPart
// StartsWith = 51;
// 52 was Strpos
// 53 was Substr
// ToHex = 54;
// 55 was ToTimestamp
// 56 was ToTimestampMillis
// 57 was ToTimestampMicros
// 58 was ToTimestampSeconds
// 59 was Now
// 60 was Translate
// Trim = 61;
// Upper = 62;
Coalesce = 63;
// 64 was Power
// 65 was StructFun
// 66 was FromUnixtime
// 67 Atan2
// 68 was DateBin
// 69 was ArrowTypeof
// 70 was CurrentDate
// 71 was CurrentTime
// 72 was Uuid
// 73 was Cbrt
// 74 Acosh
// 75 was Asinh
// 76 was Atanh
// 77 was Sinh
// 78 was Cosh
// Tanh = 79
// 80 was Pi
// 81 was Degrees
// 82 was Radians
// 83 was Factorial
// 84 was Lcm
// 85 was Gcd
// 86 was ArrayAppend
// 87 was ArrayConcat
// 88 was ArrayDims
// 89 was ArrayRepeat
// 90 was ArrayLength
// 91 was ArrayNdims
// 92 was ArrayPosition
// 93 was ArrayPositions
// 94 was ArrayPrepend
// 95 was ArrayRemove
// 96 was ArrayReplace
// 97 was ArrayToString
// 98 was Cardinality
// 99 was ArrayElement
// 100 was ArraySlice
// 103 was Cot
// 104 was ArrayHas
// 105 was ArrayHasAny
// 106 was ArrayHasAll
// 107 was ArrayRemoveN
// 108 was ArrayReplaceN
// 109 was ArrayRemoveAll
// 110 was ArrayReplaceAll
// 111 was Nanvl
// 112 was Flatten
// 113 was IsNan
// 114 was Iszero
// 115 was ArrayEmpty
// 116 was ArrayPopBack
// 117 was StringToArray
// 118 was ToTimestampNanos
// 119 was ArrayIntersect
// 120 was ArrayUnion
// 121 was OverLay
// 122 is Range
// 123 is ArrayExcept
// 124 was ArrayPopFront
// 125 was Levenshtein
// 126 was SubstrIndex
// 127 was FindInSet
// 128 was ArraySort
// 129 was ArrayDistinct
// 130 was ArrayResize
// 131 was EndsWith
// 132 was InStr
// 133 was MakeDate
// 134 was ArrayReverse
// 135 is RegexpLike
// 136 was ToChar
// 137 was ToDate
// 138 was ToUnixtime
}
message ScalarFunctionNode {
ScalarFunction fun = 1;
repeated LogicalExprNode args = 2;
}

And the references to ScalarFunctionNode

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

Thanks again @Omega359 and everyone else involved in making this happen (and welcome back @liukun4515, it is great to see you in the repo again!)

@alamb alamb merged commit 77f43c5 into apache:main Apr 22, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

Thanks @Omega359 for filing follow on tickets

#10173
#10174
#10175

ccciudatu pushed a commit to hstack/arrow-datafusion that referenced this pull request Apr 26, 2024
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 13, 2024
Upstream is continuing it's migration to UDFs.

Ref apache/datafusion#10098
Ref apache/datafusion#10372
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 13, 2024
These relied on upstream BuiltinScalarFunction, which are now removed.

Ref apache/datafusion#10098
andygrove pushed a commit to apache/datafusion-python that referenced this pull request May 14, 2024
* chore: upgrade datafusion Deps

Ref #690

* update concat and concat_ws to use datafusion_functions

Moved in apache/datafusion#10089

* feat: upgrade functions.rs

Upstream is continuing it's migration to UDFs.

Ref apache/datafusion#10098
Ref apache/datafusion#10372

* fix ScalarUDF import

* feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown

Deprecated function removed in apache/datafusion#9923

* use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option`

* remove ScalarFunction wrappers

These relied on upstream BuiltinScalarFunction, which are now removed.

Ref apache/datafusion#10098

* update dataframe `test_describe`

`null_count` was fixed upstream.

Ref apache/datafusion#10260

* remove PyDFField and related methods

DFField was removed upstream.

Ref: apache/datafusion#9595

* bump `datafusion-python` package version to 38.0.0

* re-implement `PyExpr::column_name`

The previous implementation relied on `DFField` which was removed upstream.

Ref: apache/datafusion#9595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
8 participants