-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix large strings handling in nvtext::character_tokenize #15829
Fix large strings handling in nvtext::character_tokenize #15829
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Is this a breaking change to null handling behavior?
edit: good, it is marked as such. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python changes look good to me
/merge |
Description
Fix logic for
nvtext::character_tokenize
to handle large strings input. The output for > 2GB input strings column will turn characters into rows and so will likely overflow thesize_type
rows as expected. Thethrust::count_if
is replaced with a raw kernel to produce the appropriate count that can be checked against max row size.Also changed the API to not accept null rows since the code does not check for them and can return invalid results for inputs with unsanitized-null rows.
Checklist