-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU comparison routines should use case folding, not case mapping #27540
Conversation
+1 |
@tarekgh @stephentoub Do you have any thoughts on the open question raised at #27540 (comment)? I'm trying to gauge whether it would be best to remove the special-case and rely solely on ICU's built-in mapping tables. |
In general, we are moving towards ICU even on Windows. I prefer removing this special case to serve in our future direction. We already have discrepancies between Windows and ICU anyway which I don't think this case will be a big deal. |
It turns out the special case for the Turkish I isn't a problem after all since with this change we're using simple case folding instead of upper case mapping. I folds to i So this means that under an This is the desired behavior anyway, so we can remove the special case. |
I've marked this NO MERGE for now because it also impacts APIs like |
There is also potentially security impact for this change. For example, the following mappings are valid under a case folding mechanism:
(The above data is from https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt.) This means, for instance, that after this change the expression One option is to re-introduce a special case which says "if one of the characters to check is ASCII and the other character is non-ASCII, they're never equal, regardless of what ICU case folding says." But if we go down this path we're going to have our own semantics start to creep on top of ICU's own semantics, and I don't know just where that road leads. Edit: Jeremy kindly confirmed offline for me that the expression |
It feels strange to me that something would map/fold/whatever across any two of { ASCII, non-ASCII BMP, !BMP }; because I'd personally feel like code of the form private static readonly HashSet<string> s_knownIds = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
{
"fantasy",
};
...
if (s_knownIds.Contains(input))
{
Process(input);
} has effectively asserted that |
503c720
to
16fde36
Compare
Based on the earlier feedback here and at #32247 I've updated the logic as follows. These changes are only when running under non-Windows platforms.
This implies that Another notable consequence of this change is how strings are sorted under a case-insensitive comparer. Previously, since ASCII strings were normalized to uppercase on all platforms, the strings With this PR:
This behavior is a platform-specific behavior and is not affected by the value of the |
There's still an open question as to whether we should block non-ASCII -> ASCII conversion during a string.Equals("administrator", "adminiſtrator", StringComparison.OrdinalIgnoreCase) // FALSE
"administrator".ToUpperInvariant() == "adminiſtrator".ToUpperInvariant() // TRUE If needed we could block this sort of conversion entirely without carrying the ICU / NLS delta. See also: |
I wouldn't recommend blocking that for invariant. Invariant operations still cultural operations and doesn't make sense to block this behavior there. |
I'm no longer actively working on this, so closing the PR. Moved everything to GrabYourPitchforks#8 so that I don't lose track of it. |
Fixes #26961.
Adjusts
OrdinalIgnoreCase
string comparison routines (on ICU only) to use case folding instead of case mapping.Open question: What should we do about the special-casing logic for
U+0131
andU+0049
? This PR aside, there are tons of differences in the case mapping tables between ICU and NLS, and the pair that is checked for here is only one such difference. I don't believe it's feasible for us to carry the delta between ICU and NLS within our runtime, which means we should probably scrap the below check and say "we're following the same behavior ICU has."runtime/src/libraries/Native/Unix/System.Globalization.Native/pal_collation.c
Lines 537 to 543 in 7aff91c