-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support NULL
in array functions
#6662
Conversation
@alamb can you review this PR if you have time. |
Hi @izveigor -- I will review this tomorrow. Sorry I am behind |
NULL
in array functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @izveigor -- I didn't go through this entire PR carefully, but I am not sure about the approach to null handling. I tried to explain how the rest of DataFusion handles this -- hopefully it is clear
Ok(List(Arc::new(Field::new("item", expr_type, true)))) | ||
} | ||
BuiltinScalarFunction::ArrayDims => Ok(UInt8), | ||
BuiltinScalarFunction::ArrayFill => Ok(Null), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why ArrayFill always returns Null
as its data type
My reading of https://www.postgresql.org/docs/9.1/functions-array.html suggests that it should return something like List(args[0].type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns nested list:
postgres=# select array_fill(3, array[2, 3, 2]);
array_fill
-------------------------------------------
{{{3,3},{3,3},{3,3}},{{3,3},{3,3},{3,3}}}
(1 row)
Therefore it should return nested list. I think this is the serious problem because to return nested list with right dimensions we should know the length of the second argument (list).
P.S. after the changes: #6595 it does not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_type: &DataType, | ||
schema: &DFSchema, | ||
) -> Result<Expr> { | ||
if from_type.equals_datatype(&DataType::Null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought we should be casting all the arguments that are null
to the specific type of the rest of the arguments...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this is the first way to solve the problem (see below).
} | ||
} | ||
_ => { | ||
return Err(DataFusionError::Internal(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Internal errors are only intended for bugs in DataFusion -- this error seems like it could come from bad user input too
return Err(DataFusionError::Internal(format!( | |
return Err(DataFusionError::Plan(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should consider other functions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed: #6108
use datafusion_common::ScalarValue; | ||
use datafusion_common::{DataFusionError, Result}; | ||
use datafusion_expr::ColumnarValue; | ||
use std::sync::Arc; | ||
|
||
macro_rules! downcast_arg { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same as downcast_value
: https://docs.rs/datafusion-common/26.0.0/datafusion_common/macro.downcast_value.html
caused by | ||
Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) | ||
Internal error: Optimizer rule 'simplify_expressions' failed, due to generate a different schema, original schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}, new schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}\. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something seems wrong with this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, there seems to be a merge problem.
builder.values().append_null(); | ||
} else { | ||
builder.values().append_value(arg.value(index)); | ||
for index in 0..$ARGS[0].len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this approach of taking either a ListArray
or a NullArray
In the other functions, the way NULL is treated is that the input types are always the same (in this case ListArray) and the values would be null
(aka array.is_valid(i)
would return false for rows that are null.
Complicating matters is if you type a literal null
in sql like:
select array_concat([1,2], null)
That comes to DataFusion as a null
literal (with DataType::Null). The coercion / casting logic normally will coerce this to the appropriate type.
For example, here is how I think arithmetic works with null:
select 1 + NULL
Arrives like
ScalarValue::Int32(Some(1)) + ScalarValue::Null
And then the coercion logic will add a cast to Int32:
ScalarValue::Int32(Some(1)) + CAST(ScalarValue::Null, DataType::Int32)
And then the constant folder will collapse this into:
ScalarValue::Int32(Some(1)) + ScalarValue::Int32(None)
So by the time the arithmetic kernel sees it, it only has to deal with arguments of Int32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your arguments seem reasonable.
I have two ideas regarding the introduction of this feature:
- Cast all nulls to other data types:
select make_array(1, Null, 2 Null);
----
1, 0, 2, 0
- Ignore nulls:
select make_array(1, Null, 2, Null);
----
1, 2
What do you think of these approaches @alamb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that the array contains nulls values
Like
select make_array(1, Null, 2 Null);
----
1, NULL, 2, NULL
Where the element of the ListArray are marked as null 🤔
Hello, @alamb!
Or find the information:
As I know we can create column with NULLS values. |
Thanks @izveigor -- I will try and find time to review / comment on this PR in more detail |
Yes, I agree with this sentiment -- I think I am confused. I was imagning that the My comments about |
That would be awesome! I found you have filed #6555 to track it 👍 |
It would be desirable to quickly decide the fate of this PR @alamb. |
I will review it again shortly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izveigor here are my thoughts on this PR:
- The test cases demonstrate that this PR has much better behavior than we currently have so 👍
- I think the approach in this PR to casting is not consistent with how expressions are handled in the rest of the DataFusion codebase and that worries me a lot. Specifically, the rest of the codebase will effectively convert
ScalarValue::Null
toScalarValue::TheRightType(None)
during theTypeCoercion
phase of the Analyzer type coercion and then the rest of the code has the correctly typed NULLs. This is why you don't see Null handling inBuiltInScalarFunction::return_type
for other functions. I would expect thatcoerce_arguments_for_signature
would be updated to have the appropriate logic to determine how to cast the arguments tomake_array
and other functions.
Given your past history of follow on PRs to improve the code, I think it would be acceptable to me if we wanted to merge this PR in and work in improving things as a follow on. However, I really think the current approach in this PR is problematic long term.
How would you like to proceed?
query error DataFusion error: SQL error: ParserError\("Expected an SQL statement, found: caused"\) | ||
caused by | ||
Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) | ||
# array scalar function with nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these examples look much better to me
@alamb I think it's worth creating a separate ticket for the problem. Therefore, if there are no serious claims, I think it is worth merging this PR. |
Can you please file this ticket?
Let's do it - I'll do so when the CI passes |
I merged up to resolve conflicts in |
Which issue does this PR close?
Closes #6556
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
Yes