Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaped unicode characters are encoded incorrectly #338

Closed
fsoikin opened this issue Mar 31, 2015 · 12 comments
Closed

Escaped unicode characters are encoded incorrectly #338

fsoikin opened this issue Mar 31, 2015 · 12 comments
Labels

Comments

@fsoikin
Copy link

fsoikin commented Mar 31, 2015

In strings with espaced Unicode characters (i.e. "\uABCD"), characters in the range from D800 to DFFF always get encoded as FFFD (65533 decimal).

Here's from the F# Interactive:

    > "\uD500".[0] |> uint16 ;;
    val it : uint16 = 54528us
    > "\uD700".[0] |> uint16 ;;
    val it : uint16 = 55040us
    > "\uD800".[0] |> uint16 ;;
    val it : uint16 = 65533us
    > "\uD900".[0] |> uint16 ;;
    val it : uint16 = 65533us
    > "\uE000".[0] |> uint16 ;;
    val it : uint16 = 57344us

Here's an SO question that prompted me to fiddle with it: http://stackoverflow.com/questions/29359408/surrogate-pair-detection-fails

Here's another issue I've (mistakenly?) opened in the other repo: fsharp/fsharp#399

@forki
Copy link
Contributor

forki commented Mar 31, 2015

what

forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015
forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015
@forki
Copy link
Contributor

forki commented Mar 31, 2015

I can reproduce in db7374b and confirm it works in C#:

image

@forki
Copy link
Contributor

forki commented Mar 31, 2015

[<Test>]
member this.``Can encode all unicode characters``()=
    for i in [1us..65533us] do
        let s = (i |> char).ToString()
        let r = s.[0] |> uint16
        Assert.AreEqual(i, r)

works.

So the bug is in lexer/parser and not conversion to uint16

forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015
@forki
Copy link
Contributor

forki commented Mar 31, 2015

Assert.AreEqual(55296us, Lexhelp.unicodeGraphShort("D800"))

works too.

@latkin latkin added the Bug label Mar 31, 2015
@forki
Copy link
Contributor

forki commented Mar 31, 2015

I can also confirm that Lexhelp.unicodeGraphShort is called with the correct value and returns the correct value.

forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015
@forki
Copy link
Contributor

forki commented Mar 31, 2015

seems I'm stuck for now. any ideas where to look further?

@dsyme
Copy link
Contributor

dsyme commented Apr 1, 2015

The bug is related to the behaviour of the .NET library function System.Text.Encoding.Unicode.GetString, used in the "string finisher" code in lexhelp.fs.

This

System.Text.Encoding.Unicode.GetString([| 0uy; 216uy |],0,2).[0] |> int32

returns 65533.

Note that D800 to DFFF are not actually valid UTF-16 characters, see http://en.wikipedia.org/wiki/UTF-16. So it is not entirely unreasonable that this should give odd results. It actually seems like the F# compiler should give an error on this particular Unicode literal.

@fsoikin
Copy link
Author

fsoikin commented Apr 1, 2015

@dsyme but strings containing these characters can be constructed by using a surrogate pair.
For example:

> let s = "𠮷".Substring(0,1)
let len = s.Length
let code = s.[0] |> uint16
;;

val s : string = "�"
val len : int = 1
val code : uint16 = 55362us

Therefore, doesn't it make sense for the compiler to allow using them in literals directly? Otherwise, it becomes very difficult to work with such "invalid" strings, as the OP on SO discovered.

@fsoikin
Copy link
Author

fsoikin commented Apr 1, 2015

Though, after some more thinking, I guess an error message explaining why this is invalid would be almost as useful.

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015
forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015
forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015
forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015
forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015
forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015
forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015
@forki
Copy link
Contributor

forki commented Apr 1, 2015

I created a PR which throws an error in #339.
Now I wonder if we should create a roslyn issue, since C# compiler accepts these chars and I think C#/VB and F# should behave similarly here.

@latkin
Copy link
Contributor

latkin commented Apr 1, 2015

I don't think issuing an error on all codepoints in this range is the right solution.

  • C#, per spec, explicitly allows any Unicode character in a string literal, even if the result is a malformed Unicode string
  • F# spec calls out explicitly (sec 3.5, emphasis mine)

you can directly embed the following in a string in the same way as in C#: ... Unicode characters, such as "\u0041bc"

So current behavior deviates from the F# spec, and aligning with the spec means having the same behavior as C#.

The alternative is to change F# spec so that string literals must now represent well-formed Unicode strings. If this is the case, we still can't eagerly error out as soon as we find half of a surrogate pair, since the next specified codepoint might be the proper remainder of that pair.

FWIW I think C# behavior is best - having the ability to create malformed Unicode strings (which are still perfectly valid .NET strings) is useful, e.g. in security-related tools. And you can create them anyways, e.g. new System.String( [| char 0xD800 |] ). F# seems to encode them without problems today in char literals, too.

Relevant bit of C# parser is here, they basically just parse out the specified hex value as an int (or pair of uints), then cast to char. No calls to encoding APIs.

@forki
Copy link
Contributor

forki commented Apr 2, 2015

Relevant bit of C# parser is here, they basically just parse out the specified hex value as an int (or pair of uints), then cast to char. No calls to encoding APIs.

The issue is that we use a ByteBuffer to collect the data. I think we would need something like a CharBuffer to make that work.

latkin added a commit to latkin/visualfsharp that referenced this issue Apr 8, 2015
fixes dotnet#338

Changes lexing of unicode escape sequences to match the F# spec (which says things should work the same as C#).

- For short escape sequences, directly encode the hex value into a char
- For long escape sequences, validate that the total codepoint is <= 0x0010FFFF
  - If it is, follow same logic as before (which was correct)
  - If it isn't, issue an error (same as C#)
@latkin latkin closed this as completed in 52e6e03 Apr 15, 2015
@latkin latkin added the fixed label Apr 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants