Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support NULL in array functions #6662

Merged
merged 9 commits into from
Jun 29, 2023
Merged

feat: support NULL in array functions #6662

merged 9 commits into from
Jun 29, 2023

Conversation

izveigor
Copy link
Contributor

Which issue does this PR close?

Closes #6556

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@izveigor izveigor marked this pull request as draft June 13, 2023 18:50
@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Jun 13, 2023
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Jun 19, 2023
@izveigor izveigor marked this pull request as ready for review June 19, 2023 18:55
@izveigor
Copy link
Contributor Author

@alamb can you review this PR if you have time.
I have some problems with ArrayFill function. To get its return type, we should know the length of its second element (List). So, I think we should create a separate ticket to discuss possible solutions..

@alamb
Copy link
Contributor

alamb commented Jun 20, 2023

Hi @izveigor -- I will review this tomorrow. Sorry I am behind

@alamb alamb changed the title feat: supports NULL in arrays feat: support NULL in array functions Jun 21, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @izveigor -- I didn't go through this entire PR carefully, but I am not sure about the approach to null handling. I tried to explain how the rest of DataFusion handles this -- hopefully it is clear

Ok(List(Arc::new(Field::new("item", expr_type, true))))
}
BuiltinScalarFunction::ArrayDims => Ok(UInt8),
BuiltinScalarFunction::ArrayFill => Ok(Null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why ArrayFill always returns Null as its data type

My reading of https://www.postgresql.org/docs/9.1/functions-array.html suggests that it should return something like List(args[0].type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns nested list:

postgres=# select array_fill(3, array[2, 3, 2]);
                array_fill                 
-------------------------------------------
 {{{3,3},{3,3},{3,3}},{{3,3},{3,3},{3,3}}}
(1 row)

Therefore it should return nested list. I think this is the serious problem because to return nested list with right dimensions we should know the length of the second argument (list).
P.S. after the changes: #6595 it does not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_type: &DataType,
schema: &DFSchema,
) -> Result<Expr> {
if from_type.equals_datatype(&DataType::Null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought we should be casting all the arguments that are null to the specific type of the rest of the arguments...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is the first way to solve the problem (see below).

}
}
_ => {
return Err(DataFusionError::Internal(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Internal errors are only intended for bugs in DataFusion -- this error seems like it could come from bad user input too

Suggested change
return Err(DataFusionError::Internal(format!(
return Err(DataFusionError::Plan(format!(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should consider other functions as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed: #6108

use datafusion_common::ScalarValue;
use datafusion_common::{DataFusionError, Result};
use datafusion_expr::ColumnarValue;
use std::sync::Arc;

macro_rules! downcast_arg {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caused by
Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)
Internal error: Optimizer rule 'simplify_expressions' failed, due to generate a different schema, original schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}, new schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}\. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something seems wrong with this test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, there seems to be a merge problem.

builder.values().append_null();
} else {
builder.values().append_value(arg.value(index));
for index in 0..$ARGS[0].len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this approach of taking either a ListArray or a NullArray

In the other functions, the way NULL is treated is that the input types are always the same (in this case ListArray) and the values would be null (aka array.is_valid(i) would return false for rows that are null.

Complicating matters is if you type a literal null in sql like:

select array_concat([1,2], null)

That comes to DataFusion as a null literal (with DataType::Null). The coercion / casting logic normally will coerce this to the appropriate type.

For example, here is how I think arithmetic works with null:

select 1 + NULL

Arrives like

ScalarValue::Int32(Some(1)) + ScalarValue::Null

And then the coercion logic will add a cast to Int32:

ScalarValue::Int32(Some(1)) + CAST(ScalarValue::Null, DataType::Int32)

And then the constant folder will collapse this into:

ScalarValue::Int32(Some(1)) + ScalarValue::Int32(None)

So by the time the arithmetic kernel sees it, it only has to deal with arguments of Int32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your arguments seem reasonable.
I have two ideas regarding the introduction of this feature:

  1. Cast all nulls to other data types:
select make_array(1, Null, 2 Null);
----
1, 0, 2, 0
  1. Ignore nulls:
select make_array(1, Null, 2, Null);
----
1, 2

What do you think of these approaches @alamb?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that the array contains nulls values

Like

select make_array(1, Null, 2 Null);
---- 
1, NULL, 2, NULL

Where the element of the ListArray are marked as null 🤔

@izveigor
Copy link
Contributor Author

Hello, @alamb!
I am well aware of your arguments about NULL handling in arrays and initially agreed with it.
But I found a serious problem that only initial approach can solve.
As you know arrays in PostgreSQL and other SQL database are not data types for arithmetic operations.
They are rather data types that help you look up information and form new columns.
For example, if we want to create table with unnest function (which I hope will soon be implemented in Arrow Datafusion)

postgres=# SELECT * FROM
  unnest(
    ARRAY[1, 2, 3, 4, 5],
    ARRAY['Ringo', 'George', 'Paul', 'John', NULL],
    ARRAY[29, 27, 27, 29, NULL]
  ) AS data(id,name,age);
 id |  name  | age 
----+--------+-----
  1 | Ringo  |  29
  2 | George |  27
  3 | Paul   |  27
  4 | John   |  29
  5 |        |    
(5 rows)

Or find the information:

postgres=# SELECT array_positions(array[1, 2, 3, 4, 5], 1);
 array_positions 
-----------------
 {1}
(1 row)

As I know we can create column with NULLS values.
So, I think NULLS must exist in their usual understanding (without casting and passes).
I hope I explained well.
What do you think?

@alamb
Copy link
Contributor

alamb commented Jun 24, 2023

Thanks @izveigor -- I will try and find time to review / comment on this PR in more detail

@alamb
Copy link
Contributor

alamb commented Jun 27, 2023

As I know we can create column with NULLS values.
So, I think NULLS must exist in their usual understanding (without casting and passes).

Yes, I agree with this sentiment -- I think I am confused. I was imagning that the ARRAY would be a ListArray with null elements that represented NULL.

My comments about NULLs above I think refer to how coercion works (it should be done early in the query rather than in the physical expr).

@alamb
Copy link
Contributor

alamb commented Jun 27, 2023

For example, if we want to create table with unnest function (which I hope will soon be implemented in Arrow Datafusion)

That would be awesome! I found you have filed #6555 to track it 👍

@izveigor
Copy link
Contributor Author

It would be desirable to quickly decide the fate of this PR @alamb.

@alamb
Copy link
Contributor

alamb commented Jun 28, 2023

I will review it again shortly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izveigor here are my thoughts on this PR:

  1. The test cases demonstrate that this PR has much better behavior than we currently have so 👍
  2. I think the approach in this PR to casting is not consistent with how expressions are handled in the rest of the DataFusion codebase and that worries me a lot. Specifically, the rest of the codebase will effectively convert ScalarValue::Null to ScalarValue::TheRightType(None) during the TypeCoercion phase of the Analyzer type coercion and then the rest of the code has the correctly typed NULLs. This is why you don't see Null handling in BuiltInScalarFunction::return_type for other functions. I would expect that coerce_arguments_for_signature would be updated to have the appropriate logic to determine how to cast the arguments to make_array and other functions.

Given your past history of follow on PRs to improve the code, I think it would be acceptable to me if we wanted to merge this PR in and work in improving things as a follow on. However, I really think the current approach in this PR is problematic long term.

How would you like to proceed?

query error DataFusion error: SQL error: ParserError\("Expected an SQL statement, found: caused"\)
caused by
Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)
# array scalar function with nulls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these examples look much better to me

@izveigor
Copy link
Contributor Author

@alamb I think it's worth creating a separate ticket for the problem. Therefore, if there are no serious claims, I think it is worth merging this PR.

@alamb
Copy link
Contributor

alamb commented Jun 28, 2023

@alamb I think it's worth creating a separate ticket for the problem.

Can you please file this ticket?

Therefore, if there are no serious claims, I think it is worth merging this PR.

Let's do it - I'll do so when the CI passes

@alamb
Copy link
Contributor

alamb commented Jun 28, 2023

I merged up to resolve conflicts in array.slt

@alamb
Copy link
Contributor

alamb commented Jul 7, 2023

I believe this PR introduced a regression: #6887 cc @izveigor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supports NULL in arrays
2 participants