Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gather() for STRUCT inputs with no nulls in members. #9194

Merged

Conversation

mythrocks
Copy link
Contributor

Fixes #9188.

Prior to this fix, when cudf::gather() is called on a STRUCT input
column, the null masks of the children of the result column would not
be set correctly if the child columns do not contain nulls.

This fix enforces null mask calculation if NULLIFY is set.

In addition, this commit also cleans up the TypedStructGatherTest test suite:

  1. IIFEs for STRUCT column construction.
  2. Corrections for assert() conditions.
  3. Switched column construction to column_wrapper and column_view.
  4. Corrected the output null mask for TestGatherStructOfListOfStructs.
  5. Added repro test for [BUG] gathering structs can produce valid child values underneath null struct parent #9188.

Fixes rapidsai#9188.

Prior to this fix, when `cudf::gather()` is called on a STRUCT input
column, the null masks of the children of the result column would not
be set correctly if the child columns do not contain nulls.

This fix enforces null mask calculation if NULLIFY is set.
@mythrocks mythrocks requested a review from a team as a code owner September 8, 2021 06:53
@mythrocks mythrocks self-assigned this Sep 8, 2021
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 8, 2021
@mythrocks mythrocks added bug Something isn't working non-breaking Non-breaking change labels Sep 8, 2021
@mythrocks
Copy link
Contributor Author

#9188 was obscured inadvertently in the tests. The gather map used in the tests looked like:

  auto const gather_map       = gather_map_t{-1, 4, 3, 2, 1, 7, 3};

The intention was to cause the gathered output to have a null row at index 0. This test merely caused the last row of the input column to be gathered to index 0.
It so happened that the last row in the input column was null, much to my chagrin.

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified this fixes the failing Spark tests that prompted the filing of #9188. Thanks, @mythrocks!

@mythrocks
Copy link
Contributor Author

Verified this fixes the failing Spark tests that prompted the filing of #9188. Thanks, @mythrocks!

Credit where it's due: Thank you for the fix, @jlowe. I was convinced the problem had more to do with not calling superimpose_parent_nulls(), from elsewhere.

@mythrocks
Copy link
Contributor Author

The Python test failures look unrelated to this change:

Test Result (305 failures / +305)
cudf.tests.test_concat.test_concat_empty_dataframes[True-other2-df3]
cudf.tests.test_concat.test_concat_empty_dataframes[True-other3-df0]
cudf.tests.test_concat.test_concat_empty_dataframes[True-other3-df6]
cudf.tests.test_concat.test_concat_empty_dataframes[False-other2-df3]
cudf.tests.test_concat.test_concat_empty_dataframes[True-other1-df8]
cudf.tests.test_concat.test_concat_empty_dataframes[True-other2-df4]
cudf.tests.test_concat.test_concat_empty_dataframes[False-other3-df0]
cudf.tests.test_concat.test_concat_empty_dataframes[True-other2-df0]
cudf.tests.test_concat.test_concat_empty_dataframes[True-other3-df1]
cudf.tests.test_concat.test_concat_empty_dataframes[True-other2-df2]
...

@ttnghia
Copy link
Contributor

ttnghia commented Sep 8, 2021

Rerun tests.

@codecov
Copy link

codecov bot commented Sep 8, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@c6ddd46). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9194   +/-   ##
===============================================
  Coverage                ?   10.82%           
===============================================
  Files                   ?      115           
  Lines                   ?    19166           
  Branches                ?        0           
===============================================
  Hits                    ?     2074           
  Misses                  ?    17092           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6ddd46...23f52e5. Read the comment docs.

@mythrocks
Copy link
Contributor Author

Rerun tests

1 similar comment
@mythrocks
Copy link
Contributor Author

Rerun tests

@ttnghia
Copy link
Contributor

ttnghia commented Sep 11, 2021

Rerun tests.

@ttnghia
Copy link
Contributor

ttnghia commented Sep 13, 2021

Rerun tests.

@mythrocks
Copy link
Contributor Author

mythrocks commented Sep 13, 2021

I've rebased to latest on branch-21.10. Hopefully this gets past the erstwhile local-thrust build problem. Rerunning tests now.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
just mild suggestions on style.

cpp/tests/copying/gather_struct_tests.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/tests/copying/gather_struct_tests.cpp Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit eab2486 into rapidsai:branch-21.10 Sep 13, 2021
@mythrocks mythrocks deleted the fix-structs-gather-nullify branch September 13, 2021 22:27
rapids-bot bot pushed a commit that referenced this pull request Sep 17, 2021
This PR fixes the `gather` API for structs columns when the input is a sliced column. Previously, `gather` calls `child_begin()` and `child_end()` to access the children column so if the input structs column is sliced then the output is incorrect.

This closes #9213, and is blocked by #9194 due to conflict work.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Mark Harris (https://github.com/harrism)

URL: #9218
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] gathering structs can produce valid child values underneath null struct parent
4 participants