-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Nested Arrays in Comparison Kernels #5426
Comments
I think I can apply Comparator to List first and Struct second, then replace other after Comparator is certain |
It seems null handling for non-list should be done first than list and struct |
How do we pass the |
The idea is you wouldn't, this would instead be a property of how the users choose to interpret the returned |
For List, we iterate the pair of elements and get the In this case, we may need to collect the list of It seems to be more memory efficient to have the |
You return the appropriate variant for the case, e.g.
The problem is that for comparison operations we don't establish an ordering for NULLS, but instead have special logic to handle nulls based on the operator being used. Unfortunately I realise this formulation won't work for Perhaps we just need to implement a dyn dispatch version of compare_op, not sure. I'll continue to have a play time permitting, I am currently on holiday and have limited time to spend on this. |
I think I would need to support I was testing let l1 = ListArray::from_iter_primitive::<Int32Type, _, _>(vec![
Some(vec![Some(1), None, Some(2)]),
Some(vec![Some(3), Some(4), Some(5)]),
]);
let l2 = ListArray::from_iter_primitive::<Int32Type, _, _>(vec![
Some(vec![Some(1), Some(1), Some(2)]),
Some(vec![Some(3), None]),
]); And find out |
I forgot what is the issue for My understanding is that |
Yeah, I can't remember why I decided this didn't work 😅 I guess the proof will be to wire up a draft implementation showing it working, and then we'll know for sure 😆 |
I've remembered why the Take the example of comparing two ListArray elements of |
For I think the calculation is like
|
Which is incorrect, as the result for the non-distinct case should be null 😅 Ultimately I can't see a way to avoid introducing a type-erased compare_op_dyn, SQL null semantics are annoying. We should be able to keep this an implementation detail though, I intend to have a play tomorrow |
Can you explain more on this, I may miss something I check the result in postgres and find the result is what I expected
I think whether null is first or last if we got both nulls, it could be considered the same, so we can skip to the next element. Nested array
Duckdb
|
Typically ternary logic would state that comparing a null value with anything else yields a NULL, although it is interesting that postgres does not appear to be following this in your example... It would certainly make things simpler, even if it is wildly inconsistent 😅 Edit: It looks like postgres purposefully deviates from the SQL standard here - https://www.postgresql.org/message-id/CAB4ELO7afJgQfZoQfqfMBA7Zk1AdWRkZ9mUN5jpTZupurQTRsA%40mail.gmail.com See - https://www.postgresql.org/docs/current/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON I guess we need to decide if we want to follow that |
I think we should follow Postgres and Duckdb conventions for now and not worry about SQL standard. We can always extend support for SQL standard in the future if anyone needs it. |
So I did some experiments: Presto/Trino simply doesn't support comparison with nested nulls, instead returning a not supported error. BigQuery doesn't support comparison of nested types at all. Spark appears to follow the postgres behaviour. In which case I think I agree that we can just follow postgres and define a total order, I will polish up my implementation of Compare. Edit: In fact we can possibly use the existing comparator, but pass in the null ordering 🤔 Edit edit: unfortunately spark and postgres use differing null orderings by default, which means this needs to be configurable |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently
build_compare
can be used to construct aDynComparator
that can be used to establish an ordering between two different array indices. It is up to the caller to inspect the null masks, and apply ordering as appropriate. This works well for non-nested arrays, but runs into issues when arrays have children.The introduction of logical nullability worked around this limitation for DictionaryArray and RunArray, but it is unclear how best to handle types such as StructArray and ListArray.
Part of the challenge is there isn't one way users may wish to handle nulls:
SortOptions::nulls_first
parameterAlso correctly using
DynComparator
requires understanding the distinction between logical and physical nullability, see #4691, which is confusing (#4840)Describe the solution you'd like
In order to support arbitrarily nested ListArray in comparison kernels, we need a mechanism similar to DynComparator but which is able to handle nulls.
I would like something similar to the below
We could then deprecate and eventually remove
build_compare
andDynComparator
.This would not only allow supporting nested types in the comparison kernels, but would also provide a coherent story for handling these types.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: