-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escaped unicode characters are encoded incorrectly #338
Comments
I can reproduce in db7374b and confirm it works in C#: |
works. So the bug is in lexer/parser and not conversion to uint16 |
works too. |
I can also confirm that Lexhelp.unicodeGraphShort is called with the correct value and returns the correct value. |
seems I'm stuck for now. any ideas where to look further? |
The bug is related to the behaviour of the .NET library function This
returns 65533. Note that D800 to DFFF are not actually valid UTF-16 characters, see http://en.wikipedia.org/wiki/UTF-16. So it is not entirely unreasonable that this should give odd results. It actually seems like the F# compiler should give an error on this particular Unicode literal. |
@dsyme but strings containing these characters can be constructed by using a surrogate pair.
Therefore, doesn't it make sense for the compiler to allow using them in literals directly? Otherwise, it becomes very difficult to work with such "invalid" strings, as the OP on SO discovered. |
Though, after some more thinking, I guess an error message explaining why this is invalid would be almost as useful. |
I created a PR which throws an error in #339. |
I don't think issuing an error on all codepoints in this range is the right solution.
So current behavior deviates from the F# spec, and aligning with the spec means having the same behavior as C#. The alternative is to change F# spec so that string literals must now represent well-formed Unicode strings. If this is the case, we still can't eagerly error out as soon as we find half of a surrogate pair, since the next specified codepoint might be the proper remainder of that pair. FWIW I think C# behavior is best - having the ability to create malformed Unicode strings (which are still perfectly valid .NET strings) is useful, e.g. in security-related tools. And you can create them anyways, e.g. Relevant bit of C# parser is here, they basically just parse out the specified hex value as an int (or pair of uints), then cast to char. No calls to encoding APIs. |
The issue is that we use a ByteBuffer to collect the data. I think we would need something like a CharBuffer to make that work. |
fixes dotnet#338 Changes lexing of unicode escape sequences to match the F# spec (which says things should work the same as C#). - For short escape sequences, directly encode the hex value into a char - For long escape sequences, validate that the total codepoint is <= 0x0010FFFF - If it is, follow same logic as before (which was correct) - If it isn't, issue an error (same as C#)
In strings with espaced Unicode characters (i.e.
"\uABCD"
), characters in the range fromD800
toDFFF
always get encoded asFFFD
(65533 decimal).Here's from the F# Interactive:
Here's an SO question that prompted me to fiddle with it: http://stackoverflow.com/questions/29359408/surrogate-pair-detection-fails
Here's another issue I've (mistakenly?) opened in the other repo: fsharp/fsharp#399
The text was updated successfully, but these errors were encountered: