-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Purge nonempty nulls from byte_cast list outputs. #11971
Purge nonempty nulls from byte_cast list outputs. #11971
Conversation
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## branch-23.06 #11971 +/- ##
================================================
+ Coverage 85.47% 88.16% +2.68%
================================================
Files 152 133 -19
Lines 24650 21977 -2673
================================================
- Hits 21069 19375 -1694
+ Misses 3581 2602 -979 see 107 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
This is fixing for lists. What about strings? |
Would it be better to not copy the null data to begin with instead of making a new pass to remove the data that was copied? Is there a reason that wouldn't work? |
We need to know the exact target location to copy to. If we want to generate empty lists directly,we have to recompute offsets before copying. I believe that it is not cheaper while more complicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this is looking good to me. Just a couple of questions and suggestions.
cpp/src/reshape/byte_cast.cu
Outdated
} | ||
|
||
template <typename T> | ||
struct byte_list_conversion_fn< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this one be up with the other and not straddling the dispatcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry we still have to dispatch, since there are many types in cudf::is_numeric
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Can there be tests for an empty input column too?
And perhaps tests for any throw
conditions as well? Or maybe I missed those.
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
Thanks. Having the suggested tests, I discovered bug(s) when the input is empty/all nulls. Now they're fixed. |
Since this also uses |
…/byte-cast-sanitized
…the column factory methods
194520f
to
cf486fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that we had tests explicitly noting that we had unsanitized data and working around it.
One non-blocking note about east vs west constexpr, otherwise good to ship from me.
cpp/src/reshape/byte_cast.cu
Outdated
rmm::mr::device_memory_resource*) const | ||
|
||
// Data type of the output data column after conversion. | ||
data_type constexpr output_type{data_type{type_id::UINT8}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really use "east constexpr" anywhere in our codebase. Many of the reasons that east const makes sense don't really make sense for constexpr, particularly when it comes to the confusion around pointers. For instance data_type constexpr* output_type
does not in fact mean "pointer to constant data_type", it means "constant pointer to data_type". For instance, check out this example: https://godbolt.org/z/o4YaEhrsE.
auto chars_contents = col_content.children[strings_column_view::chars_column_index]->release(); | ||
auto const num_chars = chars_contents.data->size(); | ||
auto uint8_col = std::make_unique<column>( | ||
output_type, num_chars, std::move(*(chars_contents.data)), rmm::device_buffer{}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice this section is cleaner now without the old null mask clutter.
/merge |
Description
Resolves #11754. The
byte_cast
function is creating unsanitized lists from null inputs, which is a bug. This logic copies nonzero bytes even if the input element is null. The input's null mask is copied onto the output parent list column, but the null children are nonempty. This PR fixes the bug by callingcudf::purge_nonempty_nulls
on the result before returning, if there are any nulls to be purged.Depends on:
Checklist