add Unicode.isequal_normalized function #42493

stevengj · 2021-10-04T15:28:00Z

This PR adds a function ~~isequivalent~~ isequal_normalized to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).

Previously, the only way to do this was to call Unicode.normalize on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new isequal_normalized function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than calling normalize in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.)

(In the future, we might also want to add Unicode.isless_normalized and Unicode.cmp_normalized functions for comparing Unicode strings, but isequal_normalized seemed like a good start.)

simeonschaub · 2021-10-05T00:21:01Z

It's not really clear to me what isequivalent does just from the name. Perhaps something more verbose like is_normalized_equal would be slightly clearer?

dkarrasch · 2021-10-05T10:12:16Z

Nitpick: indentation should be 4 spaces, I guess.

stevengj · 2021-10-05T12:36:15Z

Indentation should be fixed now. (For some reason, vscode was detecting Unicode.jl as using 2-space indentation and adjusted my code accordingly.)

stevengj · 2021-10-05T12:37:40Z

Maybe isequal_normalized?

Alternatively, we could just call it Unicode.isequal and not export it?

simeonschaub · 2021-10-12T18:29:40Z

Maybe isequal_normalized?

That sounds good to me. I don't want to hold this up any further though if others don't mind the name.

stevengj · 2021-10-13T13:15:36Z

renamed to isequal_normalized

This adds a function `isequal_normalized` to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks). Previously, the only way to do this was to call `Unicode.normalize` on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new `isequal_normalized` function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than calling `normalize` in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.) (In the future, we might also want to add `Unicode.isless_normalized` and `Unicode.cmp_normalized` functions for comparing Unicode strings, but `isequal_normalized` seemed like a good start.)

Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera.

Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera. (cherry picked from commit 3b250c7)

add Unicode.isequivalent function

3f15020

stevengj added the unicode Related to unicode characters and encodings label Oct 4, 2021

stevengj added 3 commits October 4, 2021 11:28

doc link

c423ac0

underscore internal symbol

e0d9e33

more tests

0fc3b51

JeffBezanson approved these changes Oct 4, 2021

View reviewed changes

whoops

ed16da3

fix indentation

8a6fa8b

rename to isequal_normalized

0359c9f

stevengj changed the title ~~add Unicode.isequivalent function~~ add Unicode.isequal_normalized function Oct 13, 2021

vtjnash merged commit 7e81414 into master Oct 13, 2021

vtjnash deleted the sgj/unicode-equivalence branch October 13, 2021 19:42

laborg mentioned this pull request Feb 10, 2022

unicode string comparison (normalized and casefolded) #40969

Closed

stevengj mentioned this pull request Dec 8, 2023

fix isequal_normalized for combining-char reordering #52447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Unicode.isequal_normalized function #42493

add Unicode.isequal_normalized function #42493

stevengj commented Oct 4, 2021 •

edited

Loading

simeonschaub commented Oct 5, 2021

dkarrasch commented Oct 5, 2021

stevengj commented Oct 5, 2021

stevengj commented Oct 5, 2021

simeonschaub commented Oct 12, 2021

stevengj commented Oct 13, 2021

add Unicode.isequal_normalized function #42493

add Unicode.isequal_normalized function #42493

Conversation

stevengj commented Oct 4, 2021 • edited Loading

simeonschaub commented Oct 5, 2021

dkarrasch commented Oct 5, 2021

stevengj commented Oct 5, 2021

stevengj commented Oct 5, 2021

simeonschaub commented Oct 12, 2021

stevengj commented Oct 13, 2021

stevengj commented Oct 4, 2021 •

edited

Loading