Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support LargeList in make_array and array_length #8121

Merged
merged 13 commits into from
Dec 3, 2023

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Nov 10, 2023

Which issue does this PR close?

Parts #8084
Parts #8185

Rationale for this change

The current implementation of the make_array and array_length functions lack support for LargeList. Therefore, we need to enhance these functions by incorporating the OffsizeTrait to enable them to work seamlessly with LargeList. This will ensure that the functions can handle a large number of elements in the list, thus improving their overall efficiency.

What changes are included in this PR?

  • make_array can be List and LargeList depending on the array length
  • support LargeList in array_lenght
  • array_array and array! can consider OffsizeTrait

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Nov 10, 2023
Comment on lines 3264 to 3157
fn return_large_array() -> ArrayRef {
// Returns: [1, 2, 3, 4]
let capacity = i32::MAX as usize + 10;
let args = vec![Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef; capacity];

println!("args.len() = {}", args.len());

make_array(&args).expect("failed to initialize function array")
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to efficiently create a LargeList using make_array as it runs out of memory on my end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you should use below constructor to construct LargeList

let list_array = Arc::new(LargeListArray::from_iter_primitive::<Int32Type, _, _>(
        vec![
            Some(vec![Some(0), Some(1), Some(2)]),
            None,
            Some(vec![Some(3), None, Some(5)]),
        ],
    )) as ArrayRef

@Weijun-H Weijun-H force-pushed the feat-support-largelist branch 5 times, most recently from 91612a4 to b5f9e68 Compare November 15, 2023 09:14
Comment on lines 3088 to 3157
fn return_large_array() -> ArrayRef {
// Returns: [1, 2, 3, 4]
let args = [
Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef,
Arc::new(Int64Array::from(vec![Some(2)])) as ArrayRef,
Arc::new(Int64Array::from(vec![Some(3)])) as ArrayRef,
Arc::new(Int64Array::from(vec![Some(4)])) as ArrayRef,
];
let data_type = DataType::Int64;
array_array::<i64>(&args, data_type).expect("failed to initialize function array")
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hack here to create a LargeList to avoid spending too much memory for testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the hack?

@Weijun-H Weijun-H marked this pull request as ready for review November 15, 2023 10:59
@Weijun-H Weijun-H force-pushed the feat-support-largelist branch 3 times, most recently from f7bdb8f to a06fbf8 Compare November 16, 2023 10:17
@Weijun-H Weijun-H mentioned this pull request Nov 17, 2023
19 tasks
@Weijun-H
Copy link
Member Author

@alamb @jayzhan211 @Veeupup
PTAL :-)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @Weijun-H -- the basic structure of most of this PR looks good to me -- thank you. I left some comments, but I think it is pretty close.

Perhaps @jayzhan211 has some additional thoughts to share

@@ -1716,7 +1716,11 @@ impl ScalarValue {
} else {
Self::iter_to_array(values.iter().cloned()).unwrap()
};
Arc::new(array_into_list_array(values))
if values.len() <= i32::MAX as usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ScalarValue type needs to match the underyling array type -- so with this change, a ScalarValue::List might return LargeListArray or ListArray. This mismatch will likely cause issues downstream

I think the standard pattern would be to add a new function ScalarValue::new_large_list and ScalarValue::LargeList if they don't already exist

Comment on lines 3088 to 3157
fn return_large_array() -> ArrayRef {
// Returns: [1, 2, 3, 4]
let args = [
Arc::new(Int64Array::from(vec![Some(1)])) as ArrayRef,
Arc::new(Int64Array::from(vec![Some(2)])) as ArrayRef,
Arc::new(Int64Array::from(vec![Some(3)])) as ArrayRef,
Arc::new(Int64Array::from(vec![Some(4)])) as ArrayRef,
];
let data_type = DataType::Int64;
array_array::<i64>(&args, data_type).expect("failed to initialize function array")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the hack?

@@ -2939,6 +3011,43 @@ mod tests {
as_uint64_array(&arr).expect("failed to initialize function array_length");

assert_eq!(result, &UInt64Array::from(vec![None]));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than extending this module with more rust tests, can you perhaps add SQL level tests using sqllogictest?

That will ensure the functions are usable end to end and that all the connections are in place

Perhaps https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/array.slt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to create an large list in sql file. I think it's why he done the test here? @Weijun-H

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do not know how we could construct the LargeList in sqllogictest, instead of an array with more i32:MAX by make_array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current idea is to use arrow_cast() to create a LargeList instead of make_array. This pr is stalled by #8290 @alamb @jayzhan211

offsets.push(0);

let mut offsets: Vec<O> = Vec::with_capacity(total_len);
offsets.push(O::from_usize(0).unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usize_as

Arc::new(Field::new("item", data_type, true)),
OffsetBuffer::new(offsets.into()),
OffsetBuffer::new(ScalarBuffer::from(offsets)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need this for usize_as

match &args[0].data_type() {
DataType::List(_) => _array_length_list::<i32>(args),
DataType::LargeList(_) => _array_length_list::<i64>(args),
_ => Err(DataFusionError::Internal(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macro

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved

}

/// array_length for List and LargeList
fn _array_length_list<O: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other name without _? In rust _xxx usually means unused.

# array_length scalar function #4
query II
select array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 1), array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 2);
----
3 2

query II
select array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 1), array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 2);
Copy link
Member Author

@Weijun-H Weijun-H Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can not test LargeList(LargeList) because arrow-rs does not support it yet. #8305

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to add a test that tries to cast LargeList(List) as a follow on PR? It would, of course, error initially, but then when we upgraded arrow to a version that did support that cast, we would have the test coverage

@@ -355,11 +364,11 @@ fn array_array(args: &[ArrayRef], data_type: DataType) -> Result<ArrayRef> {
mutable.extend_nulls(1);
}
}
offsets.push(mutable.len() as i32);
offsets.push(O::from_usize(mutable.len()).unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we dont need unwrap if casting via usize_as

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@Weijun-H
Copy link
Member Author

Weijun-H commented Nov 23, 2023

Overall LGTM

Thank you for reviewing 👍 @jayzhan211

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Weijun-H Weijun-H force-pushed the feat-support-largelist branch 2 times, most recently from 0e58ca8 to 5d678a0 Compare December 1, 2023 08:50
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @Weijun-H -- I think it is very close

match data_type {
// Either an empty array or all nulls:
DataType::Null => {
let array = new_null_array(&DataType::Null, arrays.len());
Ok(Arc::new(array_into_list_array(array)))
if len <= i32::MAX as usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think that make_array should ever return NullArray (aka DataType::Null) and instead it should always return ListArray (possibly with a null value)

However, this PR doesn't make this situation worse, so 👍

datafusion/physical-expr/src/array_expressions.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @Weijun-H

# array_length scalar function #4
query II
select array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 1), array_length(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 2);
----
3 2

query II
select array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 1), array_length(arrow_cast(array_repeat(array_repeat(array_repeat(3, 5), 2), 3), 'LargeList(List(List(Int64)))'), 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible to add a test that tries to cast LargeList(List) as a follow on PR? It would, of course, error initially, but then when we upgraded arrow to a version that did support that cast, we would have the test coverage

@alamb alamb merged commit f6af014 into apache:main Dec 3, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 3, 2023

THanks for sticking with these PRs -- I know it has taken time, but now that we have the patterns down I feel like the code is really improving at a nice rate. This is what I think successful software development and incremental improvement looks like!

appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 14, 2023
…8121)

* feat: support  LargeList in make_array and
array_length

* chore: add tests

* fix: update tests for nested array

* use usise_as

* add new_large_list

* refactor array_length

* add comment

* update test in sqllogictest

* fix ci

* fix macro

* use usize_as

* update comment

* return based on data_type in make_array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants