-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lists::index_of()
to find positions in list rows
#9510
Conversation
For any list row where both the above hold true:
Care needed to be taken to support both cases in the same code. |
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9510 +/- ##
================================================
- Coverage 10.49% 10.40% -0.09%
================================================
Files 119 119
Lines 20305 20507 +202
================================================
+ Hits 2130 2134 +4
- Misses 18175 18373 +198
Continue to review full report at Codecov.
|
Why is that? Shouldn't it be made null? |
This is so that we are able to support the SQL semantics of
It would be difficult disambiguate The SQL semantics are annoying in various flavours:
|
The end goal includes being able to use auto indices = lists::index_of(key_list, key_scalar, FIND_LAST);
auto values = lists::extract_list_element(value_list, *indices); The snag is that |
To be clear, is this the case when the list contains one or more null values? Or the list contains all null values? In other words,
What does |
Apologies for not making this clearer. auto input = [ [1,2,3,4], [1,2,NULL,4], [NULL,NULL], [1,NULL,5], [], NULL ];
contains(input, 5) == [ false, NULL, NULL, true, false, NULL ];
index_of(input, 5) == [ -1, -1, -1, 2, -1, NULL ]; @revans2 will keep me honest here, in how this is congruent with the semantics of the equivalent SQL. |
Wow, these semantics are completely bonkers. There's no way we should repeat this behavior in C++. I understand the need to differentiate between "key not found" and "list is null", but the behavior described is horribly non-intuitive and inconsistent. Here's what I suggest the behavior should be:
The previously described null behavior of |
Hmm. Now that you point it out, it does sound logical to break it up this way. We can maintain the same semantics between |
I should call out that |
Had I noticed it originally, I wouldn't have let it be merged in the first place ;)
Break away! Just use the right label. |
I'm taking this PR out of 21.12, given that it'll break compat. I should have this early in the next release. |
`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position. This commit introduces `lists::index_of()`, to return the INT32 positions of the specified search_key(s) in a LIST column. The search keys may be searched for using either `FIND_FIRST` (which finds the position of the first occurrence), or `FIND_LAST` (which finds the last occurrence). Both column_view and scalar search keys are supported. As with `lists::contains()`, nested types are not supported as search keys is `lists::index_of()`. If the search_key cannot be found, that output row is set to `-1`. Additionally, the row `output[i]` is set to null if: 1. The search_key(scalar) or search_keys[i](column_view) is null. 2. The list row `lists[i]` is null In all other cases, `output[i]` should contain a non-negative value.
Signed-off-by: MithunR <mythrocks@gmail.com>
bff7680
to
c961d14
Compare
Rerun tests |
#9870 seems to have fixed the |
1. `const` all the things. 2. Overload function names.
Applied I'm working on an additional JNI change, to allow setting a column's null-mask to the result of a boolean operation. |
Filed here. Removed @codereport, I wonder if you might have a moment to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks really good 🔥 Very "expression oriented" and easy to read! Only comments are aesthetic
@gpucibot merge |
I've merged this change now. |
* Accommodate altered semantics of `cudf::lists::contains()` rapidsai/cudf/pull/9510 changes the semantics of lists::contains(), with regard to rows containing nulls. Specifically, if a list row contains at least one null element, and is found not to contain the search key, libcudf will now return false instead of null. SparkSQL expects to return null in those cases. This commit accommodates the change in libcudf's semantics, to keep its own existing behaviour. Signed-off-by: MithunR <mythrocks@gmail.com>
Fixes #9164.
Prelude
lists::contains()
(introduced in #7039) returns aBOOL8
column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position.index_of()
This commit introduces
lists::index_of()
, to return the INT32 positions of the specified search_key(s) in a LIST column.The search keys may be searched for using either
FIND_FIRST
(which finds the position of the first occurrence), orFIND_LAST
(which finds the last occurrence). Both column_view and scalar search keys are supported.As with
lists::contains()
, nested types are not supported as search keys inlists::index_of()
.If the search_key cannot be found, that output row is set to
-1
. Additionally, the rowoutput[i]
is set to null if:search_key
(scalar) orsearch_keys[i]
(column_view) is null.lists[i]
is nullIn all other cases,
output[i]
should contain a non-negative value.Semantic changes for
lists::contains()
This commit also modifies the semantics of
lists::contains()
: it will now return nulls only for the following cases:search_key
(scalar) orsearch_keys[i]
(column_view) is null.lists[i]
is nullIn all other cases, a non-null bool is returned. Specifically
lists::contains()
no longer conforms to SQL semantics of returningNULL
for list rows that don't contain the search key, while simultaneously containing nulls. In this case,false
is returned.lists::contains_null_elements()
A new function has been introduced to check if each list row contains null elements. The semantics are similar to
lists::contains()
, in that the column returned is BOOL8 typed:true
.false
.null
.false
.The current implementation is an inefficient placeholder, to be replaced once (#9588) is available. It is included here to reconstruct the SQL semantics dropped from
lists::contains()
.