support tokenize while keeping common censor chars #767

haileyok · 2024-09-29T09:28:12Z

There are some cases where it's useful to tokenize a string while not splitting on some non-letter chars like #, *, -, or _. Unfortunately right now Tokenize will split on all of these, making some matching difficult.

This just adds a second TokenizeTextSkippingCensorChars for those particular use cases. Also adding TokenizeTextWithRegex, so that other cases can be easily covered in the future if they arise.

haileyok added 3 commits September 29, 2024 02:24

add tokenize while keeping common censor chars

0a82c6c

rename

4da4e79

clean

c8238e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support tokenize while keeping common censor chars #767

support tokenize while keeping common censor chars #767

haileyok commented Sep 29, 2024

support tokenize while keeping common censor chars #767

Are you sure you want to change the base?

support tokenize while keeping common censor chars #767

Conversation

haileyok commented Sep 29, 2024