Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569

xanatos · 2021-01-05T09:24:18Z

The new ICU handling of strings seems to have a problem with "\0" in .NET 5.0

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");

I expect all of these to write false and -1, but the CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture and InvariantCultureIgnoreCase return true and 0. This is a breaking change from 3.1 and quite illogical, considering that the \0 is a "nornal" character in .NET. I've noticed that if I use "\0test" the results are the same (true and 0)

The text was updated successfully, but these errors were encountered:

ghost · 2021-01-05T09:24:22Z

Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.

Issue Details

The new ICU handling of strings seems to have a problem with "\0" in .NET 5.0

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");

I expect all of these to write false and -1, but the CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture and InvariantCultureIgnoreCase return true and 0. This is a breaking change from 3.1 and quite illogical, considering that the \0 is a "nornal" character in .NET. I've notice that if I use "\0test" the result is the same (true and 0)

Author:	xanatos
Assignees:	-
Labels:	`area-System.Globalization`, `untriaged`
Milestone:	-

benaadams · 2021-01-05T12:05:05Z

/cc @GrabYourPitchforks @tarekgh

safern · 2021-01-06T00:45:11Z

This is by design on ICU as "\0" is a weightless character on ICU, and was discussed on this issue: #4673 (comment)

This has been the behavior in .NET Core for Unix systems since .NET Core 2.0, and as of .NET 5.0 we decided to move to use ICU by default on Windows as well to bring behavior on pair across all OSs.

You can look at the doc https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu to learn more about the change using ICU. The doc has the info how you can switch back to NLS behavior if you need to do so (however it is not recommended as long term that will be legacy).

Also, #43956 to make this change less painful for .NET 6.0 which is our LTS.

This is also a long thread that might be helpful understand some of the implications and motivation for the breaking change: #43736 (comment)

I'm going to close this issue, please let us know if you have more questions and thank you for opening the issue.

tarekgh · 2021-01-06T01:18:26Z

Just to add to what @safern mentioned:

Unicode collation has some characters which will be ignored during the cultural collation operations. Think about it as if these characters not exist at all in the string. The null character \0 is one of these characters. You can consult the Unicode standard for the whole list of ignored characters here https://www.unicode.org/charts/collation/chart_Ignored.html.

Usually for searching for such control characters, we always recommend using ordinal operation.

@xanatos feel free to send any question if you think there is anything here is unclear and thanks for reporting the issue.

xanatos · 2021-01-06T09:02:34Z

No not it is quite clear. The chart is a little misleading in the glyphs shown for the 0080-009F block, because it shows glyps that in truth have been remapped. So by looking at the chart it seems that 0080 is the Euro symbol, but in truth the Euro Symbol is 20AC and 0080 is a control character (the first thing I thought while looking at the chart was: why the Unicode team thinks the European Euro is less important than the American Dollar 🤣). But what they did has a certain logic. Sadly now to make string comparisons you'll need a master's degree in Unicode Technologies, but that is another problem.

benaadams · 2021-01-06T13:25:19Z

For single char control code; single quotes to use a char overload; which defaults to Ordinal, is easier? (also faster)

"test".IndexOf('\0')

tarekgh · 2021-01-06T17:52:44Z

So by looking at the chart it seems that 0080 is the Euro symbol, but in truth the Euro Symbol is 20AC and 0080 is a control character (the first thing I thought while looking at the chart was: why the Unicode team thinks the European Euro is less important than the American Dollar 🤣). But what they did has a certain logic.

I agree the chart can be confusing if you look at it without previous knowledge but at least the chart is listing all Unicode codepoints which are ignored which make it easy to check the behavior of such characters. I believe the chart used the euro sign in 0x80 for the reason which is, when the the euro sign initially introduced, was required to be supported in most of the codepages (not only Unicode). For most codepages, the character 0x80 is the euro sign. here is example https://en.wikipedia.org/wiki/Windows-1252. But still agree the chart is confusing.

Sadly now to make string comparisons you'll need a master's degree in Unicode Technologies, but that is another problem.

Linguistic operations can be very surprising for many languages especially if not familiar with such languages. That is why need to be conscious when doing such operations. You don't have to be expert in that at all but you need to evaluate your scenario which is using this operation. For example, if you are displaying a sorted list of strings in your app UI, would make sense to use the linguistic operation regardless of your knowledge about the details. That is because the list will be sorted according the user expectation. For search operations which you are looking for specific literal characters, then should be ordinal operations.

Feel free to send any more questions if you have any.

Dotnet-GitSync-Bot added area-System.Globalization untriaged New issue has not been triaged by the area owner labels Jan 5, 2021

xanatos changed the title ~~Contains and IndexOf handling of "\0" in .NET 5.0~~ Strange Contains and IndexOf handling of "\0" in .NET 5.0 Jan 5, 2021

safern closed this as completed Jan 6, 2021

huoyaoyuan mentioned this issue Jan 19, 2021

String.IndexOf("\0") returns always 0 in .NET 5.0 #47145

Closed

GrabYourPitchforks mentioned this issue Jan 19, 2021

Improving the developer experience with regard to default string globalization #43956

Open

ghost locked as resolved and limited conversation to collaborators Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569

Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569

xanatos commented Jan 5, 2021 •

edited

Loading

ghost commented Jan 5, 2021

benaadams commented Jan 5, 2021

safern commented Jan 6, 2021 •

edited

Loading

tarekgh commented Jan 6, 2021 •

edited

Loading

xanatos commented Jan 6, 2021 •

edited

Loading

benaadams commented Jan 6, 2021

tarekgh commented Jan 6, 2021

Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569

Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569

Comments

xanatos commented Jan 5, 2021 • edited Loading

ghost commented Jan 5, 2021

benaadams commented Jan 5, 2021

safern commented Jan 6, 2021 • edited Loading

tarekgh commented Jan 6, 2021 • edited Loading

xanatos commented Jan 6, 2021 • edited Loading

benaadams commented Jan 6, 2021

tarekgh commented Jan 6, 2021

xanatos commented Jan 5, 2021 •

edited

Loading

safern commented Jan 6, 2021 •

edited

Loading

tarekgh commented Jan 6, 2021 •

edited

Loading

xanatos commented Jan 6, 2021 •

edited

Loading