-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569
Comments
Tagging subscribers to this area: @tarekgh, @safern, @krwq Issue DetailsThe new ICU handling of strings seems to have a problem with
I expect all of these to write
|
This is by design on ICU as "\0" is a weightless character on ICU, and was discussed on this issue: #4673 (comment) This has been the behavior in .NET Core for Unix systems since .NET Core 2.0, and as of .NET 5.0 we decided to move to use ICU by default on Windows as well to bring behavior on pair across all OSs. You can look at the doc https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu to learn more about the change using ICU. The doc has the info how you can switch back to NLS behavior if you need to do so (however it is not recommended as long term that will be legacy). Also, #43956 to make this change less painful for .NET 6.0 which is our LTS. This is also a long thread that might be helpful understand some of the implications and motivation for the breaking change: #43736 (comment) I'm going to close this issue, please let us know if you have more questions and thank you for opening the issue. |
Just to add to what @safern mentioned: Unicode collation has some characters which will be ignored during the cultural collation operations. Think about it as if these characters not exist at all in the string. The null character Usually for searching for such control characters, we always recommend using ordinal operation. @xanatos feel free to send any question if you think there is anything here is unclear and thanks for reporting the issue. |
No not it is quite clear. The chart is a little misleading in the glyphs shown for the 0080-009F block, because it shows glyps that in truth have been remapped. So by looking at the chart it seems that 0080 is the Euro symbol, but in truth the Euro Symbol is 20AC and 0080 is a control character (the first thing I thought while looking at the chart was: why the Unicode team thinks the European Euro is less important than the American Dollar 🤣). But what they did has a certain logic. Sadly now to make string comparisons you'll need a master's degree in Unicode Technologies, but that is another problem. |
For single char control code; single quotes to use a char overload; which defaults to Ordinal, is easier? (also faster) "test".IndexOf('\0') |
I agree the chart can be confusing if you look at it without previous knowledge but at least the chart is listing all Unicode codepoints which are ignored which make it easy to check the behavior of such characters. I believe the chart used the euro sign in 0x80 for the reason which is, when the the euro sign initially introduced, was required to be supported in most of the codepages (not only Unicode). For most codepages, the character 0x80 is the euro sign. here is example https://en.wikipedia.org/wiki/Windows-1252. But still agree the chart is confusing.
Linguistic operations can be very surprising for many languages especially if not familiar with such languages. That is why need to be conscious when doing such operations. You don't have to be expert in that at all but you need to evaluate your scenario which is using this operation. For example, if you are displaying a sorted list of strings in your app UI, would make sense to use the linguistic operation regardless of your knowledge about the details. That is because the list will be sorted according the user expectation. For search operations which you are looking for specific literal characters, then should be ordinal operations. Feel free to send any more questions if you have any. |
The new ICU handling of strings seems to have a problem with
"\0"
in .NET 5.0I expect all of these to write
false
and-1
, but theCurrentCulture
,CurrentCultureIgnoreCase
,InvariantCulture
andInvariantCultureIgnoreCase
returntrue
and0
. This is a breaking change from 3.1 and quite illogical, considering that the\0
is a "nornal" character in .NET. I've noticed that if I use"\0test"
the results are the same (true
and0
)The text was updated successfully, but these errors were encountered: