UTF8String Constants #909

tannergooding · 2017-09-15T01:43:40Z

tannergooding
Sep 15, 2017
Collaborator

Summary

Provide a general-purpose and safe way for declaring UTF8String constants values.

Motivation

CoreFX and CoreCLR are expected to get support for UTF8Strings somewhere in the near future (work is currently being done in https://github.com/dotnet/corefxlab, to my knowledge).

When this functionality RTMs, it will not be possible to create or declare a UTF8String in C# without first either getting the raw bytes from some external source (such as returned by a File or Network stream) or by converting from a UTF16 based string.

Detailed Design

The design for this is very similar to the design for #688 in that there are basically two ways this could be supported today: bytearray literals or data declarations. There are many downsides to using bytearray literals, so this proposal only covers data declarations.

Overview (CIL)

This feature is outlined in II.16.3 Embedding data in a PE file.

CIL Grammar

DataDecl ::= [ DataLabel ‘=’ ] DdBody

DdBody ::= DdItem 
         | ‘{’ DdItemList ‘}’

DdItemList ::= DdItem [ ‘,’ DdItemList ]

DdItem ::= ‘&’ ‘(’ Id ‘)’
         | bytearray ‘(’ Bytes ‘)’
         | char ‘*’ ‘(’ QSTRING ‘)’
         | float32 [ ‘(’ Float64 ‘)’ ] [ ‘[’ Int32 ‘]’ ]
         | float64 [ ‘(’ Float64 ‘)’ ] [ ‘[’ Int32 ‘]’ ]
         | int8 [ ‘(’ Int32 ‘)’ ] [‘[’ Int32 ‘]’ ]
         | int16 [ ‘(’ Int32 ‘)’ ] [ ‘[’ Int32 ‘]’ ]
         | int32 [ ‘(’ Int32 ‘)’ ] [‘[’ Int32 ‘]’ ]
         | int64 [ ‘(’ Int64 ‘)’ ] [ ‘[’ Int32 ‘]’ ]

Accessing Data (IL)

Accessing the data is then defined as:

The data stored in a PE File using the .data directive can be accessed through a static variable, either global or a member of a type, declared at a particular position of the data

FieldDecl ::= FieldAttr* Type Id at DataLabel

The data is then accessed by a program as it would access any other static variable, using instructions such as ldsfld, ldsflda, and so on.

The ability to access data from within the PE File can be subject to platform-specific rules, typically related to section access permissions within the PE File format itself.

Overview (C#)

A new keyword should likely be provided: utf8string.

It should behave in a manner similar to the string type:

represent a sequence of zero or more UTF8 characters,
be an alias for the System.UTF8String type
should be considered immutable

Drawbacks

This would be the preferred mechanism, but comes with the caveat that data declarations don't appear to be supported by any of the major languages today, and may not have extensive testing in the runtime proper.

Alternatives

The IL metadata format currently supports declaring bytearray literals. This functions just as other literals do in that the runtime does not actually do anything with the metadata and it is instead only read and consumed by a higher-level compiler (such as the C# compiler).

The issue with this approach is that the data is not considered directly accessible and would still incur runtime cost to initialize the data before having it passed around.

So, while it would allow users to declare UTF8String literals, they would be barely any more performant than what we have today.

Unknown6656 · 2017-09-15T06:32:26Z

Unknown6656
Sep 15, 2017

~~I would prefer the keyword wstring to utf8string to keep some similarity to C++ and to save a few characters when typing.~~

Also:
#184
#789 (https://github.com/dotnet/csharplang/blob/master/meetings/2017/LDM-2017-07-05.md#utf8-strings)

0 replies

qrli · 2017-09-15T08:08:08Z

qrli
Sep 15, 2017

Or, the compiler transparently convert the string literal to utf8 when the target is System.UTF8String.

System.UTF8String u8str = "blabla";

So we don't need any new syntax.

0 replies

yaakov-h · 2017-09-15T08:32:58Z

yaakov-h
Sep 15, 2017

@Unknown6656 isn't wstring in C++ UTF-16, not UTF-8? This would be backwards...

0 replies

Mafii · 2017-09-15T08:39:15Z

Mafii
Sep 15, 2017

I would prefer not to join the chaos of c++ in terms of strings. I'd heavely favor something new (e.g. utf8string) over a already known keyword (like wstring), that might be misunderstood, misinterpreted or create confusion.

For confusion, see: https://stackoverflow.com/a/402918/5962841

0 replies

sharwell · 2017-09-15T11:13:41Z

sharwell
Sep 15, 2017
Collaborator

I would prefer the keyword wstring to utf8string to keep some similarity to C++ and to save a few characters when typing.

@Unknown6656 std::wstring from C++ is already string in C#. UTF-8 strings in C++ would be declared as std::string and you could initialize them with u8"Literal text".

0 replies

iam3yal · 2017-09-15T11:57:49Z

iam3yal
Sep 15, 2017

~~Not sure how people feel about it but maybe u8string as opposed to utf8string? slightly less verbose but I wouldn't mind either way, really. :)~~

In favor of utf8 as opposed to utf8string.

0 replies

YaakovDavis · 2017-09-15T12:08:01Z

YaakovDavis
Sep 15, 2017

u8string

Thought about this as well, but decided not to propose, as I prefer clarity over terseness :)

0 replies

iam3yal · 2017-09-15T12:23:09Z

iam3yal
Sep 15, 2017

@YaakovDavis Yeah me too but I'm not sure whether clarity is an issue here, maybe it is. :D

p.s. utf8 can work too.

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it but I do for utf8 as it's slightly terser and yet quite clear, even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

0 replies

tannergooding · 2017-09-15T14:43:09Z

tannergooding
Sep 15, 2017
Collaborator Author

I just proposed utf8string as it was "simple" and "clear".

I like utf8 as well, but wonder if it might be confused for something else (not a string), or if it might be used as a local name somewhere already (at least it seems more likely/common as an existing local name than utf8string).

0 replies

iam3yal · 2017-09-15T14:48:43Z

iam3yal
Sep 15, 2017

@tannergooding Yeah maybe, it's a minor thing, I wouldn't mind it either way but if I had to choose and utf8 is fine then I'd go with it. :)

0 replies

YaakovDavis · 2017-09-15T14:49:14Z

YaakovDavis
Sep 15, 2017

@eyalsk

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it

Assuming you refer to the name length, the same can be said about System.String & string, or Double & double, yet we still have a keyword.

0 replies

iam3yal · 2017-09-15T14:52:58Z

iam3yal
Sep 15, 2017

@YaakovDavis

If the type is System.UTF8String and the proposed keyword is utf8string I can't really see the point of having a keyword for it

I'll cite myself:

even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

Another option is string8.

This is anything but clear. :)

0 replies

tannergooding · 2017-09-15T14:53:16Z

tannergooding
Sep 15, 2017
Collaborator Author

@YaakovDavis, I had briefly considered string8 as well, but decided against it. Both because it could be potentially ambiguous and because it might be used as the name of a local already.

0 replies

YaakovDavis · 2017-09-15T14:54:39Z

YaakovDavis
Sep 15, 2017

@tannergooding
Yeah, I don't like it either :)

0 replies

tannergooding · 2017-09-15T14:56:47Z

tannergooding
Sep 15, 2017
Collaborator Author

even though we have System.String and string and there are more examples for this kind of mapping but I think it was done for "completeness" as a common, built-in types and not really because it was needed so yeah.. torn.

@eyalsk, Pushing the shift key slows down my code 😆

0 replies

Pzixel · 2018-06-05T08:58:22Z

Pzixel
Jun 5, 2018

@davidwrighton

@Pzixel I believe I wasn't clear with my encoding, the intent is to allow any number of bytes to be used in the utf8 string, null is certainly valid embedded in the middle.

Then if I save single char string, e.g. "\n", then it violates this rule:

its designed to hold counted length streams of data that are 2 byte aligned.

This string is not 2 byte aligned.

0 replies

jaredpar · 2018-06-05T16:05:08Z

jaredpar
Jun 5, 2018
Maintainer

@Pzixel

This string is not 2 byte aligned.

This can be worked around by adding a padding byte in the case of odd length strings. @davidwrighton and I sketched out a basic scheme here we feel will work just fine.

0 replies

tannergooding · 2018-06-05T16:08:37Z

tannergooding
Jun 5, 2018
Collaborator Author

While the utf16 string pool seems to indicate that it should only hold UTF16 data, that's actually a should relationship, unlike the strings heap which is required to contain valid utf8 strings, the User String heap is not documented to contain valid utf16 data, and in fact we have quite a few tests that validate behavior outside of containing valid utf16 data.

Does Mono, CoreRT, and the other runtimes also validate and work correctly with this assumption?

Also wondering, how will you differentiate a UTF8 string from a UTF16 string in the heap?

0 replies

Pzixel · 2018-06-05T16:10:02Z

Pzixel
Jun 5, 2018

@jaredpar

This can be worked around by adding a padding byte in the case of odd length strings. @davidwrighton and I sketched out a basic scheme here we feel will work just fine.

Yes, you can, but now how could you know if it's "\n" with padding zero byte or "\n\0" string without padding?

0 replies

jaredpar · 2018-06-05T16:13:42Z

jaredpar
Jun 5, 2018
Maintainer

@Pzixel

Yes, you can, but now how could you know if it's "\n" with padding zero byte or "\n\0" string without padding?

You can use a prefix encoding scheme. At the worst you end up reserving the first 1-2 bytes for an tracking whether it's an odd length string. In the case of odd length strings you pay one byte, in even length strings you pay two bytes. @davidwrighton and I thought we could get it down smaller but didn't have time to dig into the details.

@tannergooding

Does Mono, CoreRT, and the other runtimes also validate and work correctly with this assumption?

This doesn't give me much pause. If they don't then we'd need to do the work to get them to support it.

0 replies

Pzixel · 2018-06-05T16:16:33Z

Pzixel
Jun 5, 2018

@jaredpar So it's basically not a string but rather some special struct with paddings/oddness info/... Why then store it in string section? There are section for raw bytes, why just not use them?

0 replies

jaredpar · 2018-06-05T16:18:27Z

jaredpar
Jun 5, 2018
Maintainer

@Pzixel

So it's basically not a string but rather some special struct with paddings/oddness info/.

Nope. It's still very much a string. It just potentially changes the offset at which the string begins.

Why then store it in string section?

@davidwrighton has laid out a good case for this already. It has lots of advantages for the runtime and supporting languages.

0 replies

davidwrighton · 2018-06-05T17:51:17Z

davidwrighton
Jun 5, 2018
Collaborator

@tannergooding CoreClr, desktop CLR, Mono, CoreRT will all accept strings that look like this. While they aren't actually valid UTF-16 strings, they are valid sequences of UTF16 16 bit code units (which is a very low bar, a utf16 code unit is any 16 bit number), and that's what actually matters to the runtimes. Additionally, I've checked the logic of tools such as ildasm, ilasm, and ilspy, and they all handle these weird cases as well.

0 replies

mjsabby · 2018-06-17T21:55:33Z

mjsabby
Jun 17, 2018

The main argument I can think of against the ldstr approach is that it will require a JIT intrinsic to be efficient, hence a new runtime and therefore delay adoption of such a feature. Perf conscious library authors that want to target .NET Core 2.1 which is (or soon will be) LTS may avoid it.

On the other hand, the C# compiler already supports embedding a data declaration and the necessary optimization to not allocate a compile time constant byte array on the heap via dotnet/roslyn#24621 and expose it as a ReadOnlySpan<byte>.

The compromise I'm thinking of is if we can have the C# compiler support,

ReadOnlySpan<byte> myUtf8String = "Foo";

because there is precedent and a cumbersome way to achieve it,

ReadOnlySpan<byte> myUtf8String = new byte[] { 0x46, 0x6F, 0x6F };

This would make the task of defining these data declarations less onerous for the perf-conscious developer who would like to target the existing runtime and still derive benefit from a newer C# compiler.

0 replies

jpierson · 2018-08-22T03:05:32Z

jpierson
Aug 22, 2018

I would prefer to have the ability to specify custom literals as a potential compile time feature for better extensibility.

Related proposals:

0 replies

markusschaber · 2018-11-27T08:05:24Z

markusschaber
Nov 27, 2018

@kasper3 UTF8 is very different from UTF16. For example, it's quite common to write something like
const string s = "Hello world";
char lastChar = s[s.Length - 1];
You cannot do it in UTF8 because you don't have O(1) access to chars.

You also cannot do the same in UTF-16, because s[x] might be the half of a surrogate pair. And even in the BMP plane case, you only get a codepoint, which does not really equal a character in unicode in general, due to features like compositions, combiners / joiners, ligatures, RTL, ...

This misunderstanding is one of the worst "pro-utf16" arguments because it's fundamentally wrong, and most software which assumes it can index into UTF-16 strings will break for complex or non-western use cases.

0 replies

phuclv90 · 2019-06-06T02:48:08Z

phuclv90
Jun 6, 2019

I don't like digits in the keyword anyway. We don't have any now so I'm not sure we want introduce open this bottle. Something like utfstring could be ok.

@Pzixel there are already System.Int16, System.Int32 and System.Int64. utfstring is more confusing and incorrect.

It looks like you have a serious misunderstanding about Unicode and UTF. See

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Is UTF-16 fixed-width or variable-width? Why doesn't UTF-8 have byte-order problem?

@markusschaber even in western use cases it'll cause trouble because emojis are common nowadays and they also lie outside the BMP

0 replies

markusschaber · 2019-06-06T06:38:41Z

markusschaber
Jun 6, 2019

@phuclv90 Gah - Emojis. Who needs such modern crap, ASCII art rulez! ;-)

0 replies

Pzixel · 2019-06-06T08:31:41Z

Pzixel
Jun 6, 2019

@phuclv90

there are already System.Int16, System.Int32 and System.Int64. utfstring is more confusing and incorrect.

No, we don't. We have short, int and long. Do you see any numbers there? We are talking about keyword, not just a struct name.

It looks like you have a serious misunderstanding about Unicode and UTF. See

Sorry, but nothing new there. Maybe I can offer a better article to you.

@markusschaber I have recently seen emodjis in urls. Unfortunately we'l have to deal with them. Considering rtl, string reversals and so on it will be real fun.

0 replies

markusschaber · 2019-06-06T09:15:06Z

markusschaber
Jun 6, 2019

Somehow, Unicode is an overly complex mess. However, we don't have anything better, and it's the standard which everyone agrees, and it somehow works (if handled correctly).

So let's do it, and let's do it right :-)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8String Constants #909

{{title}}

Replies: 77 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

UTF8String Constants #909

tannergooding Sep 15, 2017 Collaborator

Summary

Motivation

Detailed Design

Overview (CIL)

CIL Grammar

Accessing Data (IL)

Overview (C#)

Drawbacks

Alternatives

Replies: 77 comments

sharwell Sep 15, 2017 Collaborator

tannergooding Sep 15, 2017 Collaborator Author

tannergooding Sep 15, 2017 Collaborator Author

tannergooding Sep 15, 2017 Collaborator Author

jaredpar Jun 5, 2018 Maintainer

tannergooding Jun 5, 2018 Collaborator Author

jaredpar Jun 5, 2018 Maintainer

jaredpar Jun 5, 2018 Maintainer

davidwrighton Jun 5, 2018 Collaborator

tannergooding
Sep 15, 2017
Collaborator

sharwell
Sep 15, 2017
Collaborator

tannergooding
Sep 15, 2017
Collaborator Author

tannergooding
Sep 15, 2017
Collaborator Author

tannergooding
Sep 15, 2017
Collaborator Author

jaredpar
Jun 5, 2018
Maintainer

tannergooding
Jun 5, 2018
Collaborator Author

jaredpar
Jun 5, 2018
Maintainer

jaredpar
Jun 5, 2018
Maintainer

davidwrighton
Jun 5, 2018
Collaborator