Escaped unicode characters are encoded incorrectly #338

fsoikin · 2015-03-31T13:54:16Z

In strings with espaced Unicode characters (i.e. "\uABCD"), characters in the range from D800 to DFFF always get encoded as FFFD (65533 decimal).

Here's from the F# Interactive:

    > "\uD500".[0] |> uint16 ;;
    val it : uint16 = 54528us
    > "\uD700".[0] |> uint16 ;;
    val it : uint16 = 55040us
    > "\uD800".[0] |> uint16 ;;
    val it : uint16 = 65533us
    > "\uD900".[0] |> uint16 ;;
    val it : uint16 = 65533us
    > "\uE000".[0] |> uint16 ;;
    val it : uint16 = 57344us

Here's an SO question that prompted me to fiddle with it: http://stackoverflow.com/questions/29359408/surrogate-pair-detection-fails

Here's another issue I've (mistakenly?) opened in the other repo: fsharp/fsharp#399

The text was updated successfully, but these errors were encountered:

forki · 2015-03-31T14:59:07Z

…nces dotnet#338

forki · 2015-03-31T15:26:23Z

I can reproduce in db7374b and confirm it works in C#:

forki · 2015-03-31T15:52:40Z

[<Test>]
member this.``Can encode all unicode characters``()=
    for i in [1us..65533us] do
        let s = (i |> char).ToString()
        let r = s.[0] |> uint16
        Assert.AreEqual(i, r)

works.

So the bug is in lexer/parser and not conversion to uint16

…#338

forki · 2015-03-31T16:26:10Z

Assert.AreEqual(55296us, Lexhelp.unicodeGraphShort("D800"))

works too.

forki · 2015-03-31T17:49:08Z

I can also confirm that Lexhelp.unicodeGraphShort is called with the correct value and returns the correct value.

forki · 2015-03-31T18:06:17Z

seems I'm stuck for now. any ideas where to look further?

dsyme · 2015-04-01T00:51:51Z

The bug is related to the behaviour of the .NET library function System.Text.Encoding.Unicode.GetString, used in the "string finisher" code in lexhelp.fs.

This

System.Text.Encoding.Unicode.GetString([| 0uy; 216uy |],0,2).[0] |> int32

returns 65533.

Note that D800 to DFFF are not actually valid UTF-16 characters, see http://en.wikipedia.org/wiki/UTF-16. So it is not entirely unreasonable that this should give odd results. It actually seems like the F# compiler should give an error on this particular Unicode literal.

fsoikin · 2015-04-01T01:00:28Z

@dsyme but strings containing these characters can be constructed by using a surrogate pair.
For example:

> let s = "𠮷".Substring(0,1)
let len = s.Length
let code = s.[0] |> uint16
;;

val s : string = "�"
val len : int = 1
val code : uint16 = 55362us

Therefore, doesn't it make sense for the compiler to allow using them in literals directly? Otherwise, it becomes very difficult to work with such "invalid" strings, as the OP on SO discovered.

fsoikin · 2015-04-01T01:03:52Z

Though, after some more thinking, I guess an error message explaining why this is invalid would be almost as useful.

forki · 2015-04-01T09:17:48Z

I created a PR which throws an error in #339.
Now I wonder if we should create a roslyn issue, since C# compiler accepts these chars and I think C#/VB and F# should behave similarly here.

latkin · 2015-04-01T21:16:26Z

I don't think issuing an error on all codepoints in this range is the right solution.

C#, per spec, explicitly allows any Unicode character in a string literal, even if the result is a malformed Unicode string
F# spec calls out explicitly (sec 3.5, emphasis mine)

you can directly embed the following in a string in the same way as in C#: ... Unicode characters, such as "\u0041bc"

So current behavior deviates from the F# spec, and aligning with the spec means having the same behavior as C#.

The alternative is to change F# spec so that string literals must now represent well-formed Unicode strings. If this is the case, we still can't eagerly error out as soon as we find half of a surrogate pair, since the next specified codepoint might be the proper remainder of that pair.

FWIW I think C# behavior is best - having the ability to create malformed Unicode strings (which are still perfectly valid .NET strings) is useful, e.g. in security-related tools. And you can create them anyways, e.g. new System.String( [| char 0xD800 |] ). F# seems to encode them without problems today in char literals, too.

Relevant bit of C# parser is here, they basically just parse out the specified hex value as an int (or pair of uints), then cast to char. No calls to encoding APIs.

forki · 2015-04-02T06:52:48Z

Relevant bit of C# parser is here, they basically just parse out the specified hex value as an int (or pair of uints), then cast to char. No calls to encoding APIs.

The issue is that we use a ByteBuffer to collect the data. I think we would need something like a CharBuffer to make that work.

fixes dotnet#338 Changes lexing of unicode escape sequences to match the F# spec (which says things should work the same as C#). - For short escape sequences, directly encode the hex value into a char - For long escape sequences, validate that the total codepoint is <= 0x0010FFFF - If it is, follow same logic as before (which was correct) - If it isn't, issue an error (same as C#)

fsoikin mentioned this issue Mar 31, 2015

Escaped unicode characters are incorrectly decoded fsharp/fsharp#399

Closed

forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015

Show that escaped unicode characters are encoded incorrectly - refere…

398dbcf

…nces dotnet#338

forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015

Show that escaped unicode characters are encoded incorrectly - refere…

db7374b

…nces dotnet#338

forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015

Show that conversion from unicode to uint16 works - references dotnet…

1b68e6f

…#338

latkin added the Bug label Mar 31, 2015

forki added a commit to forki/visualfsharp that referenced this issue Mar 31, 2015

Show that unicodeGraphShort works - references dotnet#338

9826335

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015

Throw error if unicode is in reserved space - fixes dotnet#338

21c7c44

forki mentioned this issue Apr 1, 2015

Throw error if unicode is in reserved space #339

Closed

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015

Throw error if unicode is in reserved space - fixes dotnet#338

be21d0c

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015

Throw error if unicode is in reserved space - fixes dotnet#338

e68de59

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015

Throw error if unicode is in reserved space - fixes dotnet#338

905e4d2

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015

Throw error if unicode is in reserved space - fixes dotnet#338

5f3a5fe

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015

Throw error if unicode is in reserved space - fixes dotnet#338

389047f

forki added a commit to forki/visualfsharp that referenced this issue Apr 1, 2015

Throw error if unicode is in reserved space - fixes dotnet#338

e7d64ae

latkin mentioned this issue Apr 8, 2015

Fix lexing for unicode escape sequences #348

Closed

latkin closed this as completed in 52e6e03 Apr 15, 2015

latkin added the fixed label Apr 15, 2015

abelbraaksma mentioned this issue Jul 24, 2018

Invalid chars in string treated as equal in C# and unequal in F# #5371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escaped unicode characters are encoded incorrectly #338

Escaped unicode characters are encoded incorrectly #338

fsoikin commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

dsyme commented Apr 1, 2015

fsoikin commented Apr 1, 2015

fsoikin commented Apr 1, 2015

forki commented Apr 1, 2015

latkin commented Apr 1, 2015

forki commented Apr 2, 2015

Escaped unicode characters are encoded incorrectly #338

Escaped unicode characters are encoded incorrectly #338

Comments

fsoikin commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

forki commented Mar 31, 2015

dsyme commented Apr 1, 2015

fsoikin commented Apr 1, 2015

fsoikin commented Apr 1, 2015

forki commented Apr 1, 2015

latkin commented Apr 1, 2015

forki commented Apr 2, 2015