Invalid UTF-8 literal strings with leading surrogate character at end of string #10973

ScottPJones · 2015-04-24T02:31:38Z

According to the comments in issue #10 by @StefanKarpinski,
"\ud800" should give an error, since that is an invalid UTF-8 string:

See this thread. This changes the behavior for bare string literals previously described in #4 to be the >following:

all bytes < 0x80: ASCIIString
has bytes ≥ 0x80 and is valid UTF-8: UTF8String
invalid UTF-8: throws an error
The b"..." string form (see #11) will let you use string syntax with \x and \u to make byte arrays. If you >want to make a UTF-8 string that contains invalid UTF-8, you can do something this:

UTF8String(b"\xff\xff")

jiahao · 2015-04-24T02:51:41Z

related: #10274

ScottPJones · 2015-04-24T14:36:01Z

I've been running into further inconsistencies and bugs with character and string literals, and would appreciate some guidance:
is_valid_utf8("\ud800") -> true
is_valid_utf8("\udc00") -> true

I would consider both of these to be bugs... you shouldn't have single surrogate characters in UTF-8,
and should only produce valid 4-byte encodings of characters > 0xffff.

StefanKarpinski · 2015-04-24T15:09:45Z

We completely ignored surrogate pair issues when writing this code. Should ideally be fixed if it can be done without wrecking performance, but I'm not entirely sure what major trouble this causes.

ScottPJones · 2015-04-24T23:06:26Z

Well, in the conversion code I've been writing, I'm getting it 3-60x faster so far (and that's only trying short strings and 64K strings, not anything really large), and they all check for invalid surrogate pairs... So I do think it can be done without sacrificing performance!
Hopefully I'll have it ready for a pull request this weekend...

Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii

ScottPJones · 2015-07-28T21:36:47Z

Closed by #11203

jiahao added the unicode Related to unicode characters and encodings label Apr 24, 2015

stevengj added a commit that referenced this issue May 15, 2015

Merge pull request #11203 from ScottPJones/spj/fixvalidutf8

2f019d7

Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii

ScottPJones closed this as completed Jul 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid UTF-8 literal strings with leading surrogate character at end of string #10973

Invalid UTF-8 literal strings with leading surrogate character at end of string #10973

ScottPJones commented Apr 24, 2015

jiahao commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

StefanKarpinski commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

ScottPJones commented Jul 28, 2015

Invalid UTF-8 literal strings with leading surrogate character at end of string #10973

Invalid UTF-8 literal strings with leading surrogate character at end of string #10973

Comments

ScottPJones commented Apr 24, 2015

jiahao commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

StefanKarpinski commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

ScottPJones commented Jul 28, 2015