Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix subword_tokenize error when input contains no tokens #13320

Merged
merged 10 commits into from
May 15, 2023

Conversation

davidwendt
Copy link
Contributor

Description

Fixes a bug where an exception is thrown when there are no tokens in the entire input column. For example, the input column is filled with strings containing only whitespace.
This special case will return token-ids of all zeroes along with an attention mask of all zeros equivalent to input.size() * max_sequence_length.

Closes #13300

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels May 9, 2023
@davidwendt davidwendt self-assigned this May 9, 2023
Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but seems like there's some leftover code? nevermind. Looks good. ctrl-f wasn't working for some reason.

cpp/src/text/subword/subword_tokenize.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to verify this change works with Morpheus (we have a workaround that does something similar, but maybe not the same).

Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty result is identical to the one we use as a workaround in Morpheus.

@davidwendt davidwendt marked this pull request as ready for review May 10, 2023 17:01
@davidwendt davidwendt requested a review from a team as a code owner May 10, 2023 17:01
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 403c83f into rapidsai:branch-23.06 May 15, 2023
@davidwendt davidwendt deleted the bug-subword-empty-tokens branch May 15, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] nvtext::subword_tokenize fails when given only whitespace
3 participants