-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Unicode.isequal_normalized function #42493
Conversation
It's not really clear to me what |
Nitpick: indentation should be 4 spaces, I guess. |
Indentation should be fixed now. (For some reason, vscode was detecting Unicode.jl as using 2-space indentation and adjusted my code accordingly.) |
Maybe Alternatively, we could just call it |
That sounds good to me. I don't want to hold this up any further though if others don't mind the name. |
renamed to |
This adds a function `isequal_normalized` to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks). Previously, the only way to do this was to call `Unicode.normalize` on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new `isequal_normalized` function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than calling `normalize` in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.) (In the future, we might also want to add `Unicode.isless_normalized` and `Unicode.cmp_normalized` functions for comparing Unicode strings, but `isequal_normalized` seemed like a good start.)
This adds a function `isequal_normalized` to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks). Previously, the only way to do this was to call `Unicode.normalize` on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new `isequal_normalized` function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than calling `normalize` in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.) (In the future, we might also want to add `Unicode.isless_normalized` and `Unicode.cmp_normalized` functions for comparing Unicode strings, but `isequal_normalized` seemed like a good start.)
Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera.
Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera. (cherry picked from commit 3b250c7)
This PR adds a function
isequivalent
isequal_normalized
to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).Previously, the only way to do this was to call
Unicode.normalize
on the two strings, to construct normalized versions, but this seemed a bit wasteful — the newisequal_normalized
function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than callingnormalize
in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.)(In the future, we might also want to add
Unicode.isless_normalized
andUnicode.cmp_normalized
functions for comparing Unicode strings, butisequal_normalized
seemed like a good start.)