Fix large strings handling in nvtext::character_tokenize #15829

davidwendt · 2024-05-22T21:07:30Z

Description

Fix logic for nvtext::character_tokenize to handle large strings input. The output for > 2GB input strings column will turn characters into rows and so will likely overflow the size_type rows as expected. The thrust::count_if is replaced with a raw kernel to produce the appropriate count that can be checked against max row size.
Also changed the API to not accept null rows since the code does not check for them and can return invalid results for inputs with unsanitized-null rows.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice

Looks good. Is this a breaking change to null handling behavior?

edit: good, it is marked as such. Thanks!

PointKernel

Looks great! Thanks

cpp/src/text/tokenize.cu

wence-

Python changes look good to me

davidwendt · 2024-06-10T13:51:23Z

/merge

Fix large strings handling in nvtext::character_tokenize

dcc1886

davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels May 22, 2024

davidwendt self-assigned this May 22, 2024

davidwendt added 4 commits May 22, 2024 20:44

Merge branch 'branch-24.08' into char-tokenize-ls

2528b1b

Merge branch 'branch-24.08' into char-tokenize-ls

02ff9a5

Merge branch 'branch-24.08' into char-tokenize-ls

99906c3

sanitize nulls in _split_by_character and pytests

6c55d78

github-actions bot added the Python Affects Python cuDF API. label May 24, 2024

davidwendt added 3 - Ready for Review Ready for review by team breaking Breaking change and removed 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels May 24, 2024

davidwendt added 2 commits May 29, 2024 08:05

Merge branch 'branch-24.08' into char-tokenize-ls

2896322

fix comment wording

d1423d8

davidwendt marked this pull request as ready for review May 29, 2024 17:21

davidwendt requested review from a team as code owners May 29, 2024 17:21

davidwendt requested review from mroeschke, brandon-b-miller, nvdbaranec and pmattione-nvidia May 29, 2024 17:21

Merge branch 'branch-24.08' into char-tokenize-ls

88f9f7d

bdice approved these changes May 30, 2024

View reviewed changes

davidwendt added 3 commits May 30, 2024 14:57

Merge branch 'branch-24.08' into char-tokenize-ls

1f1cefd

Merge branch 'branch-24.08' into char-tokenize-ls

5a5d59c

Merge branch 'branch-24.08' into char-tokenize-ls

8a1c013

davidwendt added 2 commits June 6, 2024 18:02

Merge branch 'branch-24.08' into char-tokenize-ls

204154e

Merge branch 'branch-24.08' into char-tokenize-ls

2d81538

PointKernel approved these changes Jun 7, 2024

View reviewed changes

cpp/src/text/tokenize.cu Show resolved Hide resolved

davidwendt added 2 commits June 7, 2024 15:40

Merge branch 'branch-24.08' into char-tokenize-ls

7729512

add const

2787e10

wence- approved these changes Jun 10, 2024

View reviewed changes

rapids-bot bot merged commit ae12634 into rapidsai:branch-24.08 Jun 10, 2024
70 checks passed

davidwendt deleted the char-tokenize-ls branch June 10, 2024 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix large strings handling in nvtext::character_tokenize #15829

Fix large strings handling in nvtext::character_tokenize #15829

davidwendt commented May 22, 2024

bdice left a comment •

edited

Loading

PointKernel left a comment

wence- left a comment

davidwendt commented Jun 10, 2024

Fix large strings handling in nvtext::character_tokenize #15829

Fix large strings handling in nvtext::character_tokenize #15829

Conversation

davidwendt commented May 22, 2024

Description

Checklist

bdice left a comment • edited Loading

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

davidwendt commented Jun 10, 2024

bdice left a comment •

edited

Loading