Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge nonempty nulls from byte_cast list outputs. #11971

Merged
merged 25 commits into from
Apr 13, 2023

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Oct 21, 2022

Description

Resolves #11754. The byte_cast function is creating unsanitized lists from null inputs, which is a bug. This logic copies nonzero bytes even if the input element is null. The input's null mask is copied onto the output parent list column, but the null children are nonempty. This PR fixes the bug by calling cudf::purge_nonempty_nulls on the result before returning, if there are any nulls to be purged.

Depends on:

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice bdice added bug Something isn't working non-breaking Non-breaking change labels Oct 21, 2022
@bdice bdice self-assigned this Oct 21, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 21, 2022
@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Patch coverage has no change and project coverage change: +2.68 🎉

Comparison is base (ed9385b) 85.47% compared to head (d18c557) 88.16%.

❗ Current head d18c557 differs from pull request most recent head 92c9dfb. Consider uploading reports for the commit 92c9dfb to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.06   #11971      +/-   ##
================================================
+ Coverage         85.47%   88.16%   +2.68%     
================================================
  Files               152      133      -19     
  Lines             24650    21977    -2673     
================================================
- Hits              21069    19375    -1694     
+ Misses             3581     2602     -979     

see 107 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bdice bdice marked this pull request as ready for review October 28, 2022 00:44
@bdice bdice requested a review from a team as a code owner October 28, 2022 00:44
@ttnghia
Copy link
Contributor

ttnghia commented Oct 28, 2022

This is fixing for lists. What about strings?

@hyperbolic2346
Copy link
Contributor

Would it be better to not copy the null data to begin with instead of making a new pass to remove the data that was copied? Is there a reason that wouldn't work?

@bdice
Copy link
Contributor Author

bdice commented Oct 28, 2022

This is fixing for lists. What about strings?

Very good catch @ttnghia. I'll address this soon.

@bdice bdice changed the base branch from branch-22.12 to branch-23.02 November 17, 2022 17:21
@ttnghia
Copy link
Contributor

ttnghia commented Dec 22, 2022

Would it be better to not copy the null data to begin with instead of making a new pass to remove the data that was copied? Is there a reason that wouldn't work?

We need to know the exact target location to copy to. If we want to generate empty lists directly,we have to recompute offsets before copying. I believe that it is not cheaper while more complicated.

hyperbolic2346
hyperbolic2346 previously approved these changes Dec 30, 2022
@bdice bdice changed the base branch from branch-23.02 to branch-23.04 January 26, 2023 16:59
@bdice bdice dismissed hyperbolic2346’s stale review January 26, 2023 16:59

The base branch was changed.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed conda labels Apr 8, 2023
@ttnghia ttnghia marked this pull request as ready for review April 8, 2023 02:00
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this is looking good to me. Just a couple of questions and suggestions.

cpp/src/reshape/byte_cast.cu Outdated Show resolved Hide resolved
cpp/src/reshape/byte_cast.cu Outdated Show resolved Hide resolved
}

template <typename T>
struct byte_list_conversion_fn<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one be up with the other and not straddling the dispatcher?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry we still have to dispatch, since there are many types in cudf::is_numeric.

cpp/src/reshape/byte_cast.cu Outdated Show resolved Hide resolved
cpp/src/reshape/byte_cast.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Can there be tests for an empty input column too?
And perhaps tests for any throw conditions as well? Or maybe I missed those.

cpp/tests/reshape/byte_cast_tests.cpp Outdated Show resolved Hide resolved
ttnghia and others added 3 commits April 9, 2023 18:01
@ttnghia
Copy link
Contributor

ttnghia commented Apr 10, 2023

Can there be tests for an empty input column too?

Thanks. Having the suggested tests, I discovered bug(s) when the input is empty/all nulls. Now they're fixed.

@ttnghia
Copy link
Contributor

ttnghia commented Apr 10, 2023

Since this also uses lists::make_empty_lists_column, it depends on:

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that we had tests explicitly noting that we had unsanitized data and working around it.

One non-blocking note about east vs west constexpr, otherwise good to ship from me.

rmm::mr::device_memory_resource*) const

// Data type of the output data column after conversion.
data_type constexpr output_type{data_type{type_id::UINT8}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really use "east constexpr" anywhere in our codebase. Many of the reasons that east const makes sense don't really make sense for constexpr, particularly when it comes to the confusion around pointers. For instance data_type constexpr* output_type does not in fact mean "pointer to constant data_type", it means "constant pointer to data_type". For instance, check out this example: https://godbolt.org/z/o4YaEhrsE.

cpp/src/reshape/byte_cast.cu Outdated Show resolved Hide resolved
auto chars_contents = col_content.children[strings_column_view::chars_column_index]->release();
auto const num_chars = chars_contents.data->size();
auto uint8_col = std::make_unique<column>(
output_type, num_chars, std::move(*(chars_contents.data)), rmm::device_buffer{}, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice this section is cleaner now without the old null mask clutter.

@ttnghia
Copy link
Contributor

ttnghia commented Apr 13, 2023

/merge

@rapids-bot rapids-bot bot merged commit 4b34831 into rapidsai:branch-23.06 Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cudf::byte_cast can return non-empty null values.
6 participants