Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lists::index_of() to find positions in list rows #9510

Merged
merged 20 commits into from
Dec 20, 2021

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Oct 23, 2021

Fixes #9164.

Prelude

lists::contains() (introduced in #7039) returns a BOOL8 column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position.

index_of()

This commit introduces lists::index_of(), to return the INT32 positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either FIND_FIRST (which finds the position of the first occurrence), or FIND_LAST (which finds the last occurrence). Both column_view and scalar search keys are supported.

As with lists::contains(), nested types are not supported as search keys in lists::index_of().

If the search_key cannot be found, that output row is set to -1. Additionally, the row output[i] is set to null if:

  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row lists[i] is null

In all other cases, output[i] should contain a non-negative value.

Semantic changes for lists::contains()

This commit also modifies the semantics of lists::contains(): it will now return nulls only for the following cases:

  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row lists[i] is null

In all other cases, a non-null bool is returned. Specifically lists::contains() no longer conforms to SQL semantics of returning NULL for list rows that don't contain the search key, while simultaneously containing nulls. In this case, false is returned.

lists::contains_null_elements()

A new function has been introduced to check if each list row contains null elements. The semantics are similar to lists::contains(), in that the column returned is BOOL8 typed:

  1. If even 1 element in a list row is null, the returned row is true.
  2. If no element is null, the returned row is false.
  3. If the list row is null, the returned row is null.
  4. If the list row is empty, the returned row is false.

The current implementation is an inefficient placeholder, to be replaced once (#9588) is available. It is included here to reconstruct the SQL semantics dropped from lists::contains().

@mythrocks mythrocks requested a review from a team as a code owner October 23, 2021 00:32
@mythrocks mythrocks self-assigned this Oct 23, 2021
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2021
@mythrocks mythrocks added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Oct 23, 2021
@mythrocks
Copy link
Contributor Author

mythrocks commented Oct 23, 2021

index_of() and contains() share a backend implementation. There is a difference in semantics between the two in the following case:

  1. The list row does not contain the search key. And...
  2. The list row also contains null values

For any list row where both the above hold true:

  1. contains() returns null for that row.
  2. index_of() returns -1, indicating that the row wasn't found.

Care needed to be taken to support both cases in the same code.

@codecov
Copy link

codecov bot commented Oct 23, 2021

Codecov Report

Merging #9510 (a5beb0f) into branch-22.02 (967a333) will decrease coverage by 0.08%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9510      +/-   ##
================================================
- Coverage         10.49%   10.40%   -0.09%     
================================================
  Files               119      119              
  Lines             20305    20507     +202     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18373     +198     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c5a85a...a5beb0f. Read the comment docs.

@jrhemstad
Copy link
Contributor

If the search_key cannot be found, that output row is set to -1.

Why is that? Shouldn't it be made null?

@mythrocks
Copy link
Contributor Author

mythrocks commented Oct 25, 2021

Why is that? Shouldn't it be made null?

This is so that we are able to support the SQL semantics of array_index() in Spark, (Teradata, and possibly others) which make the following distinction:

  1. If list row is null, or the search key is null, return null.
  2. If neither is null, return a non-null index indicating "valid, but not found".

It would be difficult disambiguate -1 from null if both conditions were to return null. Replacing -1 with null if necessary as an after step seemed easier.

The SQL semantics are annoying in various flavours:

  1. The array_index() indices are 1-based. spark-rapids is going to have to add 1 to the result of lists::index_of().
  2. array_contains() does not follow the semantics of array_index(), as described here.

@mythrocks
Copy link
Contributor Author

The end goal includes being able to use lists::index_of() for key lookups in maps (i.e. LIST<STRUCT<key, value>>.
I hope to do this with:

auto indices = lists::index_of(key_list, key_scalar, FIND_LAST);
auto values  = lists::extract_list_element(value_list, *indices);

The snag is that indices will need to be preprocessed to replace nulls and -1 with MAXINT.

@jrhemstad
Copy link
Contributor

There is a difference in semantics between the two in the following case:

  1. The list row does not contain the search key. And...
  2. The list row also contains null values

For any list row where both the above hold true:

  1. contains() returns null for that row.
  2. index_of() returns -1, indicating that the row wasn't found.

To be clear, is this the case when the list contains one or more null values? Or the list contains all null values?

In other words,

key: 5
list0 : [1, 2, 3, 4]
list1: [1, 2, NULL, 4]
list2: [ NULL, NULL]

What does contains return for these lists?

@mythrocks
Copy link
Contributor Author

What does contains return for these lists?

Apologies for not making this clearer.
The described semantics are for when even one null is found in a LIST row. Ergo:

auto input             = [ [1,2,3,4], [1,2,NULL,4], [NULL,NULL], [1,NULL,5], [],    NULL ];
contains(input, 5)    == [   false,      NULL,       NULL,         true,     false, NULL ];
index_of(input, 5)    == [   -1,          -1,          -1,            2,     -1,    NULL ];

@revans2 will keep me honest here, in how this is congruent with the semantics of the equivalent SQL.

@jrhemstad
Copy link
Contributor

Wow, these semantics are completely bonkers. There's no way we should repeat this behavior in C++.

I understand the need to differentiate between "key not found" and "list is null", but the behavior described is horribly non-intuitive and inconsistent.

Here's what I suggest the behavior should be:

auto input             = [ [1,2,3,4], [1,2,NULL,4], [NULL,NULL], [1,NULL,5], [],    NULL ];
contains(input, 5)    == [   false,      false,       false,         true,     false, NULL ];
index_of(input, 5)    == [   -1,          -1,          -1,            2,     -1,    NULL ];

The previously described null behavior of contains can be emulated by a contains_nulls API for lists that returns true/false if a list contains nulls. This result can be combined with the results from contains to give the desired output.

@mythrocks
Copy link
Contributor Author

can be emulated by a contains_nulls API...

Hmm. Now that you point it out, it does sound logical to break it up this way. We can maintain the same semantics between contains() and index_of() (and contains_null). That makes sense.

@mythrocks mythrocks marked this pull request as draft October 26, 2021 00:08
@mythrocks
Copy link
Contributor Author

mythrocks commented Oct 28, 2021

The previously described null behavior of contains can be emulated...

I should call out that lists::contains() already has this null behaviour. spark-rapids depends on this, starting from NVIDIA/spark-rapids/pull/1565. Alteration of this behaviour would be a breaking change.

@jrhemstad
Copy link
Contributor

I should call out that lists::contains() already has this null behaviour.

Had I noticed it originally, I wouldn't have let it be merged in the first place ;)

Alteration of this behaviour would be a breaking change.

Break away! Just use the right label.

@mythrocks
Copy link
Contributor Author

I'm taking this PR out of 21.12, given that it'll break compat. I should have this early in the next release.

`lists::contains()` (introduced in rapidsai#7039) returns a `BOOL8` column,
indicating whether the specified search_key(s) exist at all in each
corresponding list row of an input LIST column. It does not return
the actual position.

This commit introduces `lists::index_of()`, to return the INT32
positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST`
(which finds the position of the first occurrence), or `FIND_LAST`
(which finds the last occurrence). Both column_view and scalar
search keys are supported.

As with `lists::contains()`, nested types are not supported as
search keys is `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`.
Additionally, the row `output[i]` is set to null if:
  1. The search_key(scalar) or search_keys[i](column_view) is null.
  2. The list row `lists[i]` is null
In all other cases, `output[i]` should contain a non-negative value.
Signed-off-by: MithunR <mythrocks@gmail.com>
@github-actions github-actions bot added the CMake CMake build issue label Nov 23, 2021
@mythrocks mythrocks requested a review from harrism December 9, 2021 08:50
@mythrocks
Copy link
Contributor Author

Rerun tests

@mythrocks
Copy link
Contributor Author

#9870 seems to have fixed the cmake-format failures. It should now be possible to rerun tests.

1. `const` all the things.
2. Overload function names.
@mythrocks mythrocks requested a review from jlowe December 13, 2021 23:22
@mythrocks mythrocks added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 14, 2021
@mythrocks
Copy link
Contributor Author

Applied DO NOT MERGE label. Currently, merging this would break spark-rapids.

I'm working on an additional JNI change, to allow setting a column's null-mask to the result of a boolean operation.

@mythrocks
Copy link
Contributor Author

mythrocks commented Dec 16, 2021

I'm working on an additional JNI change, to allow setting a column's null-mask to the result of a boolean operation.

Filed here.

Removed DO NOT MERGE label. The spark-rapids plugin code change to accept this breakage is ready.

@codereport, I wonder if you might have a moment to review.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks really good 🔥 Very "expression oriented" and easy to read! Only comments are aesthetic

cpp/include/cudf/lists/contains.hpp Outdated Show resolved Hide resolved
cpp/src/lists/contains.cu Outdated Show resolved Hide resolved
@mythrocks mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 17, 2021
@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a4dc42d into rapidsai:branch-22.02 Dec 20, 2021
@mythrocks
Copy link
Contributor Author

I've merged this change now.
Thank you all for the reviews.

mythrocks added a commit to NVIDIA/spark-rapids that referenced this pull request Dec 20, 2021
* Accommodate altered semantics of `cudf::lists::contains()`

rapidsai/cudf/pull/9510 changes the semantics of lists::contains(), with regard to rows containing nulls. Specifically, if a list row contains at least one null element, and is found not to contain the search key, libcudf will now return false instead of null.
SparkSQL expects to return null in those cases.

This commit accommodates the change in libcudf's semantics, to keep its own existing behaviour.

Signed-off-by: MithunR <mythrocks@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] index_in_list() to return the index of a search-key in a list row
7 participants