-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ScalarUDF: Remove supports_zero_argument
and avoid creating null array for empty args
#10193
ScalarUDF: Remove supports_zero_argument
and avoid creating null array for empty args
#10193
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very nice
let len = if args.is_empty() { | ||
1 | ||
} else { | ||
return exec_err!("Expect random function to take no param"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be more consistent with the other changes in this PR:
return exec_err!("Expect random function to take no param"); | |
return exec_err!("Expect {} function to take no parameters", self.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this before, and I noticed that random
is used in test so I need to move it out as an independent function, but adding name
as another argument looks so weird, so I keep it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can even return ScalarValue::F64 instead of Array. Let me fix these too
let len = if args.is_empty() { | ||
1 | ||
} else { | ||
return internal_err!("Invalid argument type"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to have the test match the rest of the functions for consistency
return internal_err!("Invalid argument type"); | |
return exec_err!("Expect function to take no parameters"); |
datafusion/functions/src/math/pi.rs
Outdated
return exec_err!("Expect pi function to take no param"); | ||
} | ||
if !args.is_empty() { | ||
return exec_err!("Expect {} function to take no param", self.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return exec_err!("Expect {} function to take no param", self.name()); | |
return exec_err!("Expect {} function to take no parameters", self.name()); |
let len = if args.is_empty() { | ||
1 | ||
} else { | ||
return exec_err!("Expect {} function to take no param", self.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return exec_err!("Expect {} function to take no param", self.name()); | |
return exec_err!("Expect {} function to take no parameters", self.name()); |
supports_zero_argument
and avoid creating null array for empty args
|
||
#[test] | ||
fn test_random_expression() { | ||
let args = vec![ColumnarValue::Array(Arc::new(NullArray::new(1)))]; | ||
let array = random(&args) | ||
let array = random(&[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this test, since it is already covered
datafusion/datafusion/sqllogictest/test_files/expr.slt
Lines 788 to 790 in da82cec
SELECT | |
random() BETWEEN 0.0 AND 1.0, | |
random() = random() |
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@@ -403,123 +398,6 @@ async fn test_user_defined_functions_with_alias() -> Result<()> { | |||
Ok(()) | |||
} | |||
|
|||
#[derive(Debug)] | |||
pub struct RandomUDF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have RandomFunc
, they are the same so remove it.
} | ||
let array = Float64Array::from_value(std::f64::consts::PI, 1); | ||
Ok(ColumnarValue::Array(Arc::new(array))) | ||
fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think signature check is enough, so just ignore args
} | ||
|
||
#[tokio::test] | ||
async fn deregister_udf() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is just moved , the test remains
@@ -615,6 +493,22 @@ async fn test_user_defined_functions_cast_to_i64() -> Result<()> { | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn deregister_udf() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
datafusion/expr/src/udf.rs
Outdated
/// # Zero Argument Functions | ||
/// If the function has zero parameters (e.g. `now()`) it will be passed a | ||
/// single element slice which is a a null array to indicate the batch's row | ||
/// count (so the function can know the resulting array size). | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please leave a note about what is required to implement Zero Argument Functions
? I think the expectation is that the output is a single ColumnarValue::Scalar
, rather than an Array
|
||
assert_eq!(floats.len(), 1); | ||
assert!(0.0 <= floats.value(0) && floats.value(0) < 1.0); | ||
fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea here is that expectation is that rand
is invoked once per row rather than once per batch. And the only way it knew how many rows to make is to get a null array in 🤔
For example, when I run datafusion-cli
from this PR to call random() the same value is returned for each row:
> create table foo as values (1), (2), (3), (4), (5);
0 row(s) fetched.
Elapsed 0.018 seconds.
> select column1, random() from foo;
+---------+--------------------+
| column1 | random() |
+---------+--------------------+
| 1 | 0.9594375709000513 |
| 2 | 0.9594375709000513 |
| 3 | 0.9594375709000513 |
| 4 | 0.9594375709000513 |
| 5 | 0.9594375709000513 |
+---------+--------------------+
5 row(s) fetched.
Elapsed 0.012 seconds.
But I expect that each row has a different value for random()
However, since none of the tests failed, clearly we have a gap in test coverage 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Let me think about how to design it, I would prefer something like support_random
to specialize random()
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a invoke_no_args(num_rows: usize)
method to the ScalarUDFImpl
-- with a default implementation that returns "not implemented" error
That might make it clear what was happening and would provide clear semantics about what to do in this case 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see the message then, it is also a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we test the uuid
function? i think the uuid
has the same attribute like random
cc @jayzhan211
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have test like
query II
SELECT octet_length(uuid()), length(uuid())
----
36 36
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have test like
query II SELECT octet_length(uuid()), length(uuid()) ---- 36 36
I means the uuid
function is also invoked once per row rather than once per batch
like random()
function mentioned by alamb.
I test the function in the spark
spark-sql> desc test;
col1 int NULL
Time taken: 0.065 seconds, Fetched 1 row(s)
spark-sql> select * from test;
1
2
spark-sql> select *,uuid() from test;
1 6b04b66c-2e6c-4925-8b18-a9d51d5ed80a
2 3e0be0c2-9ff2-422a-8f3f-5cdb6551264b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file a issue for this: #10247
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
How about adding num_rows to invoke function? Let user decide how to deal
with args and num_rows?
```rust
fn invoke(&self, args: &[ColumnarValue], num_rows: usize) -> Result<ColumnarValue>;
```
It also satisfies the need for both args and num_rows at once.
…On Wed, Apr 24, 2024, 6:27 PM Kun Liu ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In datafusion/physical-expr/src/scalar_function.rs
<#10193 (comment)>:
>
// evaluate the function
match self.fun {
- ScalarFunctionDefinition::UDF(ref fun) => fun.invoke(&inputs),
+ ScalarFunctionDefinition::UDF(ref fun) => {
+ if fun.support_randomness() {
+ fun.invoke_no_args(batch.num_rows())
Do we have any method to hide the special behavior?
—
Reply to this email directly, view it on GitHub
<#10193 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZCLR3GAV44GZ2J76JUXXDY66CITAVCNFSM6AAAAABGUPBNIWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMJZGQ4TANZSGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 -- I think this now looks good. I retested with datafusion-cli
on this PR and they make different values for each row.
DataFusion CLI v37.1.0
> create table t as values (1), (2);
0 row(s) fetched.
Elapsed 0.023 seconds.
> select random(), uuid() from t;
+--------------------+--------------------------------------+
| random() | uuid() |
+--------------------+--------------------------------------+
| 0.7899418736375414 | 0e35ed7f-a059-43c1-9fe2-52afaf38c1d7 |
| 0.3470811911542333 | 42be02cc-cf5a-4736-b768-536cff69eb9c |
+--------------------+--------------------------------------+
2 row(s) fetched.
Elapsed 0.014 seconds.
I am going to add some additional tests to ensure this case is covered
Here is a PR with tests that validate we get the right answer for multiple rows: #10247 Thanks again @jayzhan211 for cleaning this up and @liukun4515 for your reviews |
Thanks @alamb and @liukun4515 |
Which issue does this PR close?
Issue from https://github.com/apache/datafusion/pull/10098/files/a422b3ab10519efea3ec671fef944a1dcf8aaf96#r1573855672
Closes #10205 .
Closes #10247
Rationale for this change
What changes are included in this PR?
supports_zero_argument
Introducesupport_randomness
invoke_no_args
Are these changes tested?
Are there any user-facing changes?