Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Support nulls in array_replace, avoid a copy #8054

Merged
merged 8 commits into from
Nov 9, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 4, 2023

Which issue does this PR close?

Follow on to #8050 from @jayzhan211 --

Rationale for this change

The change in #8050 is a huge improvement, and while reviewing it I saw a few more things to improve, specifically have it handle nulls handing and avoiding some copies.

I also wanted to try and help illustrate some patterns for other array function rewrites that I think @jayzhan211 is planning

What changes are included in this PR?

  1. Add comments to general_array and change names of parameters and variables so it is easier (in my opinion) to follow the logic
  2. Handles NULL values

Are these changes tested?

  1. Covered by existing test
  2. Added a new test for null values

Are there any user-facing changes?

array_replace now properly handles NULL values

@github-actions github-actions bot added the physical-expr Physical Expressions label Nov 4, 2023
let from_array = &args[1];
let to_array = &args[2];

/// For each element of `list_array[i]`, replaces up to `arr_n[i]` occurences
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, having a clear description of what the function does, especially when it is mind bending likegeneral_replace helps a lot. I also find that writing such a description often helps improve the code as well as results in better naming

For example, I struggled to explain how this function worked when it took &[ArrayRef] because then the explanations were in terms fof arg[0], arg[1], and arg[2]. Giving the parameters names made things much clearer


for (row_index, (arr, n)) in list_array.iter().zip(arr_n.iter()).enumerate() {
// n is the number of elements to replace in this row
for (row_index, (list_array_row, n)) in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rename these variables to better explain what they are (rather than arr a shorthand for array which is already overloaded in this function

bool_values.push(Some(a.eq(&from_a)));
} else {
return internal_err!(
"Null value is not supported in array_replace"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why nulls aren't supported -- all of the kernels support them well, and the code is more concise without the special case

}
};

// Use MutableArrayData to build the replaced array
let original_data = list_array_row.to_data();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a new array with two indexes, I found naming them helped a lot

new_empty_array(&data_type)
} else {
let new_values: Vec<_> = new_values.iter().map(|a| a.as_ref()).collect();
arrow::compute::concat(&new_values)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old logic called concat for each row, which I think is O(N^2) as it keeps copying the same values over and over again. I changed it to call concat just once

I think it is probably possible to avoid calling concat entirely (use MutableArrayData to build up values directly). We can do that as a follow on PR perhaps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to replace MutableArrayData with ListArray, as it is more straightforward. However, perhaps we can also construct ListArray using MutableArrayData and still maintain readability. Particularly in this case, doing so can enhance performance. We should certainly consider building with MutableArrayData!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I was getting at is that the contents of values are each already created using MutableArrayData -- and so I think rather than making many small arrays and then concating them together, the could could just use the same MutableArrayData to build the results directly

@@ -2763,6 +2790,52 @@ mod tests {
);
}

#[test]
fn test_array_replace_with_null() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the null cases were not covered, I added new coverage. Maybe this style of test could help as we clean up the other aray functions

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having test in slt file if possible, we can cleanup test in the follow on PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slt is a good idea -- I will move the tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c6dabcf

@alamb alamb marked this pull request as ready for review November 5, 2023 10:42
@alamb alamb changed the title Minor: clean up array_replace Support nulls in array_replace, avoid a copy Nov 5, 2023
#[test]
fn test_array_replace_with_null() {
// ([3, 1, NULL, 3], 3, 4, 2) => [4, 1, NULL, 4] NULL not matched
// ([3, 1, NULL, 3], NULL, 5, 2) => [3, 1, NULL, 3] NULL not replaced (not eq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why null is not replaced with 5 => [3, 1, 5, 3]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that arrow_ord::cmp::eq([3,1,null,3], null) return [null,null,null,null] which I expect to see [false,false,true,false]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's reasonable for arrow-rs to handle null comparisons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because in SQL, NULL = NULL is NOT true like you might expect, but is actually NULL

If we want the NULL = NULL = true behavior, we could use the not_distinct kernel instead: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.not_distinct.html

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because in SQL, NULL = NULL is NOT true like you might expect, but is actually NULL

If we want the NULL = NULL = true behavior, we could use the not_distinct kernel instead: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.not_distinct.html

I would prefer having not_distinct here, it can let us replace null value with another, if we use eq here, replace with null value is meaningless, also there is no other DB have the similar function #7072, so I think we can have our own definition of what array_replace do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good. I will update this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in acbffa0

@@ -2763,6 +2790,52 @@ mod tests {
);
}

#[test]
fn test_array_replace_with_null() {
Copy link
Contributor

@jayzhan211 jayzhan211 Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having test in slt file if possible, we can cleanup test in the follow on PR.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 6, 2023
@alamb alamb changed the title Support nulls in array_replace, avoid a copy Minor: Support nulls in array_replace, avoid a copy Nov 6, 2023
@alamb alamb requested a review from jayzhan211 November 6, 2023 19:14
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alamb
Copy link
Contributor Author

alamb commented Nov 8, 2023

@jackwener or @liukun4515 can I trouble you for an approval of this PR so I can merge it? @jayzhan211 has already reviewed it

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! Thanks @alamb

datafusion/physical-expr/src/array_expressions.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/array_expressions.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Nov 9, 2023

Nice job! Thanks @alamb

Thank you for the review @jackwener and @jayzhan211 -- very much appreciated.

@alamb alamb merged commit 93af440 into apache:main Nov 9, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants