Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(text): unicode support and word splitting according to case #5447

Merged
merged 14 commits into from
Jul 22, 2024

Conversation

guy-borderless
Copy link
Contributor

The current implementation of splitToWords() only supports Latin letters. This PR adds a unicode-compatible implementation + support for WordSpliting according to case.

Copy link
Contributor

@iuioiua iuioiua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Just a few nits on the first pass. Thank you.

text/_util.ts Outdated Show resolved Hide resolved
text/_util.ts Outdated Show resolved Hide resolved
text/case.ts Outdated Show resolved Hide resolved
@kt3k kt3k changed the title feat(text): Unicode support and WordSpliting according to case fix(text): Unicode support and WordSpliting according to case Jul 18, 2024
Copy link

codecov bot commented Jul 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.29%. Comparing base (a8a637f) to head (90c7286).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5447      +/-   ##
==========================================
- Coverage   96.30%   96.29%   -0.01%     
==========================================
  Files         465      465              
  Lines       37705    37711       +6     
  Branches     5561     5560       -1     
==========================================
+ Hits        36312    36315       +3     
- Misses       1351     1354       +3     
  Partials       42       42              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

text/_util.ts Outdated Show resolved Hide resolved
@kt3k
Copy link
Member

kt3k commented Jul 22, 2024

By this change, the result of toSnakeCase("Camel1Case") becomes camel_1_case (It is camel1_case in main), while the result of toSnakeCase("camel1Case") is kept as camel1_case. Is this change reasonable/acceptable?

@guy-borderless
Copy link
Contributor Author

guy-borderless commented Jul 22, 2024

By this change, the result of toSnakeCase("Camel1Case") becomes camel_1_case (It is camel1_case in main), while the result of toSnakeCase("camel1Case") is kept as camel1_case. Is this change reasonable/acceptable?

Looking at a more semantic example like iAte2Cookies the change does seem reasonable to me.

@kt3k
Copy link
Member

kt3k commented Jul 22, 2024

Looking at a more semantic example like iAte2Cookies the change does seem reasonable to me.

I agree with this.

In that case, probably we should return has_2_cookies for toSnakeCase("has2Cookies") for consistency (It currently returns has2_cookies). I'll work on this.

@guy-borderless
Copy link
Contributor Author

guy-borderless commented Jul 22, 2024

I made some changes to better handle Acronyms followed by a capitalized word, like in URLPattern (before it would have been ["URLP", "aattern"], now return ["URL", "Pattern"].
But went ahead and simplified the implementation to only use one input.match() call

I think an added benefit of doing everything with one input.match may be that if we will want to export splitToWords it'll be simple to allow the user to extend the word regexp (including with non-alphanumeric patterns like a URL or an email address) and also it now returns an iterator.

@iuioiua iuioiua changed the title fix(text): Unicode support and WordSpliting according to case fix(text): unicode support and word splitting according to case Jul 22, 2024
text/_util.ts Show resolved Hide resolved
Copy link
Member

@kt3k kt3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kt3k
Copy link
Member

kt3k commented Jul 22, 2024

I made some changes to better handle Acronyms followed by a capitalized word, like in URLPattern (before it would have been ["URLP", "aattern"], now return ["URL", "Pattern"].

This looks a great improvement to me. Nice!

@guy-borderless guy-borderless force-pushed the unicode-and-case-wordSplit branch from dfd42c7 to 90c7286 Compare July 22, 2024 11:37
@kt3k kt3k merged commit 97c5596 into denoland:main Jul 22, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants