Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF-8 literal strings with leading surrogate character at end of string #10973

Closed
ScottPJones opened this issue Apr 24, 2015 · 5 comments
Labels
unicode Related to unicode characters and encodings

Comments

@ScottPJones
Copy link
Contributor

According to the comments in issue #10 by @StefanKarpinski,
"\ud800" should give an error, since that is an invalid UTF-8 string:

See this thread. This changes the behavior for bare string literals previously described in #4 to be the >following:

all bytes < 0x80: ASCIIString
has bytes ≥ 0x80 and is valid UTF-8: UTF8String
invalid UTF-8: throws an error
The b"..." string form (see #11) will let you use string syntax with \x and \u to make byte arrays. If you >want to make a UTF-8 string that contains invalid UTF-8, you can do something this:

UTF8String(b"\xff\xff")

@jiahao
Copy link
Member

jiahao commented Apr 24, 2015

related: #10274

@jiahao jiahao added the unicode Related to unicode characters and encodings label Apr 24, 2015
@ScottPJones
Copy link
Contributor Author

I've been running into further inconsistencies and bugs with character and string literals, and would appreciate some guidance:
is_valid_utf8("\ud800") -> true
is_valid_utf8("\udc00") -> true

I would consider both of these to be bugs... you shouldn't have single surrogate characters in UTF-8,
and should only produce valid 4-byte encodings of characters > 0xffff.

@StefanKarpinski
Copy link
Sponsor Member

We completely ignored surrogate pair issues when writing this code. Should ideally be fixed if it can be done without wrecking performance, but I'm not entirely sure what major trouble this causes.

@ScottPJones
Copy link
Contributor Author

Well, in the conversion code I've been writing, I'm getting it 3-60x faster so far (and that's only trying short strings and 64K strings, not anything really large), and they all check for invalid surrogate pairs... So I do think it can be done without sacrificing performance!
Hopefully I'll have it ready for a pull request this weekend...

stevengj added a commit that referenced this issue May 15, 2015
Fix #11141/#10973 and improve performance of is_valid_utf8/is_valid_ascii
@ScottPJones
Copy link
Contributor Author

Closed by #11203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

3 participants