UTF8String Constants #909
Replies: 77 comments
-
Also: |
Beta Was this translation helpful? Give feedback.
-
Or, the compiler transparently convert the string literal to utf8 when the target is System.UTF8String u8str = "blabla"; So we don't need any new syntax. |
Beta Was this translation helpful? Give feedback.
-
@Unknown6656 isn't |
Beta Was this translation helpful? Give feedback.
-
I would prefer not to join the chaos of c++ in terms of strings. I'd heavely favor something new (e.g. utf8string) over a already known keyword (like wstring), that might be misunderstood, misinterpreted or create confusion. For confusion, see: https://stackoverflow.com/a/402918/5962841 |
Beta Was this translation helpful? Give feedback.
-
@Unknown6656 |
Beta Was this translation helpful? Give feedback.
-
In favor of |
Beta Was this translation helpful? Give feedback.
-
Thought about this as well, but decided not to propose, as I prefer clarity over terseness :) |
Beta Was this translation helpful? Give feedback.
-
@YaakovDavis Yeah me too but I'm not sure whether clarity is an issue here, maybe it is. :D p.s. If the type is |
Beta Was this translation helpful? Give feedback.
-
I just proposed I like |
Beta Was this translation helpful? Give feedback.
-
@tannergooding Yeah maybe, it's a minor thing, I wouldn't mind it either way but if I had to choose and |
Beta Was this translation helpful? Give feedback.
-
Assuming you refer to the name length, the same can be said about |
Beta Was this translation helpful? Give feedback.
-
I'll cite myself:
This is anything but clear. :) |
Beta Was this translation helpful? Give feedback.
-
@YaakovDavis, I had briefly considered |
Beta Was this translation helpful? Give feedback.
-
@tannergooding |
Beta Was this translation helpful? Give feedback.
-
@eyalsk, Pushing the shift key slows down my code 😆 |
Beta Was this translation helpful? Give feedback.
-
Then if I save single char string, e.g.
This string is not 2 byte aligned. |
Beta Was this translation helpful? Give feedback.
-
This can be worked around by adding a padding byte in the case of odd length strings. @davidwrighton and I sketched out a basic scheme here we feel will work just fine. |
Beta Was this translation helpful? Give feedback.
-
Does Mono, CoreRT, and the other runtimes also validate and work correctly with this assumption? Also wondering, how will you differentiate a UTF8 string from a UTF16 string in the heap? |
Beta Was this translation helpful? Give feedback.
-
Yes, you can, but now how could you know if it's "\n" with padding zero byte or "\n\0" string without padding? |
Beta Was this translation helpful? Give feedback.
-
You can use a prefix encoding scheme. At the worst you end up reserving the first 1-2 bytes for an tracking whether it's an odd length string. In the case of odd length strings you pay one byte, in even length strings you pay two bytes. @davidwrighton and I thought we could get it down smaller but didn't have time to dig into the details.
This doesn't give me much pause. If they don't then we'd need to do the work to get them to support it. |
Beta Was this translation helpful? Give feedback.
-
@jaredpar So it's basically not a string but rather some special struct with paddings/oddness info/... Why then store it in string section? There are section for raw bytes, why just not use them? |
Beta Was this translation helpful? Give feedback.
-
Nope. It's still very much a string. It just potentially changes the offset at which the string begins.
@davidwrighton has laid out a good case for this already. It has lots of advantages for the runtime and supporting languages. |
Beta Was this translation helpful? Give feedback.
-
@tannergooding CoreClr, desktop CLR, Mono, CoreRT will all accept strings that look like this. While they aren't actually valid UTF-16 strings, they are valid sequences of UTF16 16 bit code units (which is a very low bar, a utf16 code unit is any 16 bit number), and that's what actually matters to the runtimes. Additionally, I've checked the logic of tools such as ildasm, ilasm, and ilspy, and they all handle these weird cases as well. |
Beta Was this translation helpful? Give feedback.
-
The main argument I can think of against the On the other hand, the C# compiler already supports embedding a data declaration and the necessary optimization to not allocate a compile time constant byte array on the heap via dotnet/roslyn#24621 and expose it as a The compromise I'm thinking of is if we can have the C# compiler support, ReadOnlySpan<byte> myUtf8String = "Foo"; because there is precedent and a cumbersome way to achieve it, ReadOnlySpan<byte> myUtf8String = new byte[] { 0x46, 0x6F, 0x6F }; This would make the task of defining these data declarations less onerous for the perf-conscious developer who would like to target the existing runtime and still derive benefit from a newer C# compiler. |
Beta Was this translation helpful? Give feedback.
-
I would prefer to have the ability to specify custom literals as a potential compile time feature for better extensibility. Related proposals: |
Beta Was this translation helpful? Give feedback.
-
You also cannot do the same in UTF-16, because This misunderstanding is one of the worst "pro-utf16" arguments because it's fundamentally wrong, and most software which assumes it can index into UTF-16 strings will break for complex or non-western use cases. |
Beta Was this translation helpful? Give feedback.
-
@Pzixel there are already It looks like you have a serious misunderstanding about Unicode and UTF. See
@markusschaber even in western use cases it'll cause trouble because emojis are common nowadays and they also lie outside the BMP |
Beta Was this translation helpful? Give feedback.
-
@phuclv90 Gah - Emojis. Who needs such modern crap, ASCII art rulez! |
Beta Was this translation helpful? Give feedback.
-
No, we don't. We have short, int and long. Do you see any numbers there? We are talking about keyword, not just a struct name.
Sorry, but nothing new there. Maybe I can offer a better article to you. @markusschaber I have recently seen emodjis in urls. Unfortunately we'l have to deal with them. Considering rtl, string reversals and so on it will be real fun. |
Beta Was this translation helpful? Give feedback.
-
Somehow, Unicode is an overly complex mess. However, we don't have anything better, and it's the standard which everyone agrees, and it somehow works (if handled correctly). So let's do it, and let's do it right :-) |
Beta Was this translation helpful? Give feedback.
-
Summary
Provide a general-purpose and safe way for declaring UTF8String constants values.
Motivation
CoreFX and CoreCLR are expected to get support for UTF8Strings somewhere in the near future (work is currently being done in https://github.com/dotnet/corefxlab, to my knowledge).
When this functionality RTMs, it will not be possible to create or declare a UTF8String in C# without first either getting the raw bytes from some external source (such as returned by a File or Network stream) or by converting from a UTF16 based string.
Detailed Design
The design for this is very similar to the design for #688 in that there are basically two ways this could be supported today:
bytearray literals
ordata declarations
. There are many downsides to usingbytearray literals
, so this proposal only coversdata declarations
.Overview (CIL)
This feature is outlined in
II.16.3 Embedding data in a PE file
.CIL Grammar
Accessing Data (IL)
Accessing the data is then defined as:
Overview (C#)
A new keyword should likely be provided:
utf8string
.It should behave in a manner similar to the
string
type:System.UTF8String
typeDrawbacks
This would be the preferred mechanism, but comes with the caveat that
data declarations
don't appear to be supported by any of the major languages today, and may not have extensive testing in the runtime proper.Alternatives
The IL metadata format currently supports declaring
bytearray
literals. This functions just as other literals do in that the runtime does not actually do anything with the metadata and it is instead only read and consumed by a higher-level compiler (such as the C# compiler).The issue with this approach is that the data is not considered directly accessible and would still incur runtime cost to initialize the data before having it passed around.
So, while it would allow users to declare UTF8String literals, they would be barely any more performant than what we have today.
Beta Was this translation helpful? Give feedback.
All reactions