Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Nested Arrays in Comparison Kernels #5426

Closed
tustvold opened this issue Feb 24, 2024 · 16 comments · Fixed by #5792
Closed

Support Nested Arrays in Comparison Kernels #5426

tustvold opened this issue Feb 24, 2024 · 16 comments · Fixed by #5792
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

tustvold commented Feb 24, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently build_compare can be used to construct a DynComparator that can be used to establish an ordering between two different array indices. It is up to the caller to inspect the null masks, and apply ordering as appropriate. This works well for non-nested arrays, but runs into issues when arrays have children.

The introduction of logical nullability worked around this limitation for DictionaryArray and RunArray, but it is unclear how best to handle types such as StructArray and ListArray.

Part of the challenge is there isn't one way users may wish to handle nulls:

  • When sorting nulls are ordered based on the SortOptions::nulls_first parameter
  • When used in a comparison operator, the behaviour of nulls depends on the operator

Also correctly using DynComparator requires understanding the distinction between logical and physical nullability, see #4691, which is confusing (#4840)

Describe the solution you'd like

In order to support arbitrarily nested ListArray in comparison kernels, we need a mechanism similar to DynComparator but which is able to handle nulls.

I would like something similar to the below

pub enum Compare {
    LeftNull,
    RightNull,
    BothNull,
    Less,
    Equal,
    Greater,
}

impl Compare {
    pub fn ordering(&self, nulls_first: bool) -> Ordering {
        ...
    }

    #[inline]
    pub fn is_null(&self) -> bool {
        matches!(self, Self::LeftNull | Self::RightNull | Self::BothNull)
    }
}

pub type Comparator = Box<dyn Fn(usize, usize) -> Compare + Send + Sync>;

pub fn build_comparator(left: &dyn Array, right: &dyn Array) -> Result<Comparator> {
    ...
}

We could then deprecate and eventually remove build_compare and DynComparator.

This would not only allow supporting nested types in the comparison kernels, but would also provide a coherent story for handling these types.

Describe alternatives you've considered

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Feb 24, 2024
@jayzhan211
Copy link
Contributor

I think I can apply Comparator to List first and Struct second, then replace other after Comparator is certain

@jayzhan211
Copy link
Contributor

I think I can apply Comparator to List first and Struct second, then replace other after Comparator is certain

It seems null handling for non-list should be done first than list and struct

@jayzhan211
Copy link
Contributor

How do we pass the null_first parameter? Should we extend compare_op or introduce a new one?

@tustvold
Copy link
Contributor Author

How do we pass the null_first parameter? Should we extend compare_op or introduce a new one?

The idea is you wouldn't, this would instead be a property of how the users choose to interpret the returned Compare

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 25, 2024

For List, we iterate the pair of elements and get the Compare for each of them.
How do we determine the final single Compare based on them without null_first?

In this case, we may need to collect the list of Compare until we know what null_first is, which is when ordering(null_first) is called after collecting all the sub-Compare for the nested list. For a higher dimension list, we need to collect more.

It seems to be more memory efficient to have the null_first info in the first place, so we can evaluate the actual ordering early, instead of collecting them and evaluate at the end. And, we can avoid to introduce something like pub type Comparator = Box<dyn Fn(usize, usize) -> Vec<Compare> + Send + Sync>; to collect those Compares.

@tustvold
Copy link
Contributor Author

tustvold commented Feb 25, 2024

How do we determine the final single Compare based on them without null_first?

You return the appropriate variant for the case, e.g. Compare::LeftNull, Compare::RightNull or Compare::BothNull

It seems to be more memory efficient to have the null_first info in the first place, so we can evaluate the actual ordering early

The problem is that for comparison operations we don't establish an ordering for NULLS, but instead have special logic to handle nulls based on the operator being used.

Unfortunately I realise this formulation won't work for DISTINCT and so we may need something else 🤔

Perhaps we just need to implement a dyn dispatch version of compare_op, not sure.

I'll continue to have a play time permitting, I am currently on holiday and have limited time to spend on this.

@tustvold tustvold added development-process Related to development process of arrow-rs and removed help wanted labels Feb 25, 2024
@tustvold tustvold reopened this Feb 25, 2024
@tustvold tustvold self-assigned this Feb 25, 2024
@tustvold tustvold removed the development-process Related to development process of arrow-rs label Feb 25, 2024
@jayzhan211
Copy link
Contributor

jayzhan211 commented Apr 8, 2024

I think I would need to support build_compare nulls for non-nested types like primitive first.

I was testing

let l1 = ListArray::from_iter_primitive::<Int32Type, _, _>(vec![
            Some(vec![Some(1), None, Some(2)]),
            Some(vec![Some(3), Some(4), Some(5)]),
        ]);
        let l2 = ListArray::from_iter_primitive::<Int32Type, _, _>(vec![
            Some(vec![Some(1), Some(1), Some(2)]),
            Some(vec![Some(3), None]),
        ]);

And find out compare_primitive does not handle nulls well. None is equal to 0 for i32, but this is not an expected result. I think your design about both_null, and left_null is helpful for null comparison.

@jayzhan211
Copy link
Contributor

As stated on the ticket, this needs more thought as it won't work for distinct, it is on my list to work out a way to support this but I haven't had time recently

I forgot what is the issue for distinct case, can you explain more?

My understanding is that a is not distinct from b if every element is exactly the same or say every element is either Eq or BothNull.

@tustvold
Copy link
Contributor Author

Yeah, I can't remember why I decided this didn't work 😅

I guess the proof will be to wire up a draft implementation showing it working, and then we'll know for sure 😆

@tustvold
Copy link
Contributor Author

tustvold commented May 20, 2024

I've remembered why the Compare formulation doesn't work.

Take the example of comparing two ListArray elements of [null, 1] and [null, 1]. In the distinct case the result should be false, but in the equality case the result should be null. Now you could just return Compare::BothNull in this case, but what should you return for [null, 1] vs [null, 2], returning Compare::BothNull would be incorrect, but as would Compare::Less

@tustvold tustvold changed the title Dyn Comparison of Nested Arrays Support Nested Arrays in Comparison Kernels May 20, 2024
@jayzhan211
Copy link
Contributor

in the equality case the result should be null
Can we return true for the equality case for [null, 1] and [null, 1]? Why should not we consider null equal to null?

For [null, 1] and [null, 2], I think it is reasonable to return Compare::Less for < op, true for distinct, and false for eq.
Why should we return BothNull for [null, 1] vs [null, 2].

I think the calculation is like

  • [null,1] < [null, 2], compare elementwise, got BothNull, and Less. Iterate left to right, skip BothNull or Eq, so we return Less here.

@tustvold
Copy link
Contributor Author

so we return Less here

Which is incorrect, as the result for the non-distinct case should be null 😅

Ultimately I can't see a way to avoid introducing a type-erased compare_op_dyn, SQL null semantics are annoying. We should be able to keep this an implementation detail though, I intend to have a play tomorrow

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 21, 2024

Which is incorrect, as the result for the non-distinct case should be null

Can you explain more on this, I may miss something

I check the result in postgres and find the result is what I expected

postgres=# select array[null, 1] < array[null, 2];
 ?column? 
----------
 t
(1 row)

I think whether null is first or last if we got both nulls, it could be considered the same, so we can skip to the next element.

Nested array
Postgres

postgres=# select array[array[null, 2], array[null, 1]] < array[array[null, 1], array[null, 2]];
 ?column? 
----------
 f
(1 row)

postgres=# select array[null, array[null, 1]] < array[null, array[null, 2]];
ERROR:  multidimensional arrays must have array expressions with matching dimensions

Duckdb

D select array[null, array[null, 1]] < array[null, array[null, 2]];
┌─────────────────────────────────────────────────────────────────────┐
│ ((ARRAY[NULL, (ARRAY[NULL, 1])]) < (ARRAY[NULL, (ARRAY[NULL, 2])])) │
│                               boolean                               │
├─────────────────────────────────────────────────────────────────────┤
│ true                                                                │
└─────────────────────────────────────────────────────────────────────┘

@tustvold
Copy link
Contributor Author

tustvold commented May 21, 2024

Typically ternary logic would state that comparing a null value with anything else yields a NULL, although it is interesting that postgres does not appear to be following this in your example... It would certainly make things simpler, even if it is wildly inconsistent 😅

Edit: It looks like postgres purposefully deviates from the SQL standard here - https://www.postgresql.org/message-id/CAB4ELO7afJgQfZoQfqfMBA7Zk1AdWRkZ9mUN5jpTZupurQTRsA%40mail.gmail.com

See - https://www.postgresql.org/docs/current/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON

I guess we need to decide if we want to follow that

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 21, 2024

I think we should follow Postgres and Duckdb conventions for now and not worry about SQL standard. We can always extend support for SQL standard in the future if anyone needs it.

@tustvold
Copy link
Contributor Author

tustvold commented May 21, 2024

So I did some experiments:

Presto/Trino simply doesn't support comparison with nested nulls, instead returning a not supported error.

BigQuery doesn't support comparison of nested types at all.

Spark appears to follow the postgres behaviour.

In which case I think I agree that we can just follow postgres and define a total order, I will polish up my implementation of Compare.

Edit: In fact we can possibly use the existing comparator, but pass in the null ordering 🤔

Edit edit: unfortunately spark and postgres use differing null orderings by default, which means this needs to be configurable

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2024
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2024
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2024
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 21, 2024
tustvold added a commit that referenced this issue May 28, 2024
… (#5792)

* Push SortOptions into DynComparator (#5426)

* Clippy

* Review feedback

* Tweak print_row
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants