Rethink string representation #68

fstirlitz · 2019-08-17T19:04:24Z

(cribbed from README.md)

Unlike strings in JavaScript, Lua strings are not Unicode strings, but bytestrings (sequences of 8-bit values); likewise, implementations of Lua parse the source code as a sequence of octets. However, the input to this parser is a JavaScript string, i.e. a sequence of 16-bit code units (not necessarily well-formed UTF-16). This poses a problem of how those code units should be interpreted, particularly if they are outside the Basic Latin block ('ASCII').

Currently, this parser handles Unicode input by encoding it in WTF-8, and reinterpreting the resulting code units as Unicode code points. This applies to string literals and (if extendedIdentifiers is enabled) to identifiers as well. Lua byte escapes inside string literals are interpreted directly as code points, while Lua 5.3 \u{} escapes are similarly decoded as UTF-8 code units reinterpreted as code points. It is as if the parser input was being interpreted as ISO-8859-1, while actually being encoded in UTF-8.

This ensures that no otherwise-valid input will be rejected due to encoding errors. Assuming the input was originally encoded in UTF-8 (which includes the case of only containing ASCII characters), it also preserves the following properties:

String literals (and identifiers, if extendedIdentifiers is enabled) will have the same representation in the AST if and only if they represent the same string in the source code: e.g. the Lua literals '💩', '\u{1f4a9}' and '\240\159\146\169' will all have "\u00f0\u009f\u0092\u00a9" in their .value property, and likewise local 💩 will have the same string in its .name property;
The String.prototype.charCodeAt method in JS can be directly used to emulate Lua's string.byte (with one argument, after shifting offsets by 1), and likewise String.prototype.substr can be used similarly to Lua's string.sub;
The .length property of decoded string values in the AST is equal to the value that the # operator would return in Lua.

Maintaining those properties makes the logic of static analysers and code transformation tools simpler. However, it poses a problem when displaying strings to the user and serialising AST back into a string; to recover the original bytestrings, values transformed in this way will have to be encoded in ISO-8859-1.

Other solutions to this problem may be considered in the future. Some of them have been listed below, with their drawbacks:

A mode that instead treats the input as if it were decoded according to ISO-8859-1 (or the x-user-defined encoding) and rejects code points that cannot appear in that encoding; may be useful for source code in encodings other than UTF-8
- Still tricky to get semantics correctly
- x-user-defined cannot take advantage of compact representation of ISO-8859-1 strings in certain JavaScript engines
Using an ArrayBuffer or Uint8Array for source code and/or string literals
- May fail to be portable to older JavaScript engines
- Cannot be (directly) serialised as JSON
- Values of those types are fixed-length, which makes manipulation cumbersome; they cannot be incrementally built by appending.
- They cannot be used as keys in objects; one has to use Map and WeakMap instead
Using a plain Array of numbers in the range [0, 256)
- May be memory-inefficient in naïve JavaScript engines
- May bloat the JSON serialisation considerably
- Cannot be used as keys in objects either
Storing string literal values as ordinary String values, and requiring that escape sequences in literals constitute well-formed UTF-8; an exception is thrown if they do not
- UTF-8 chauvinism; imposes semantics that may be unwanted
- Reduced compatibility with other Lua implementations
Like above, but instead of throwing an exception, ill-formed escapes are transformed to unpaired surrogates, just like Python's surrogateescape encoding error handler
- UTF-8 chauvinism, though to a lesser extent
- Destroys the property that ("\xc4" .. "\x99") == "\xc4\x99"
- If the AST is encoded in JSON, some JSON libraries may refuse to parse it

Cf. discussion under c05822d.

The text was updated successfully, but these errors were encountered:

fstirlitz · 2019-08-21T16:31:59Z

I will probably add a switch to toggle between these modes:

no interpretation for string literals at all; extended identifiers not mangled
pseudo-ISO-8859-1/x-user-defined (option 0)
UTF-8 (either current behaviour or option 3/4)

fstirlitz · 2020-01-16T01:09:03Z

Got some WIP code that implements an encodingMode option, allowing to switch between:

current behaviour
no mangling for identifiers, .value of string literal nodes is null
ISO-8859-1
x-user-defined.

A 'true UTF-8' mode (option 3 or 4) would be considerably mode involved, and perhaps not worth it. Still considering it, though.

fstirlitz · 2020-02-09T18:50:08Z

I changed my mind; I won't keep current behaviour as the default, maybe I won't even keep it as an option; nobody seems to expect or desire it anyway. The default will be no mangling and no literal interpretation; this allows users to parse Unicode source code without hassle, while those interested in string literals can choose some other mode that ensures a coherent interpretation.

I implemented UTF-8 modes too, but they're a little hacky. I also still need to document the option.

fstirlitz · 2020-02-22T10:21:53Z

Finally committed as fstirlitz:2b04739...fstirlitz:10666c7.

Leaving out UTF-8 modes for the moment; I may add them later. I’m leaving this issue open until I make a decision, but either way it goes, it’s not a release blocker.

retorquere · 2021-04-26T13:09:18Z

I'd still be interested in UTF-8. I've tried reading up on x-user-defined but did not come away with an understanding where it would break down -- I am interested in literal strings as I want to use luaparse to change lua source into JS.

fstirlitz added the compat hazard Resolving this issue may create backwards compatibility problems label Aug 17, 2019

fstirlitz added this to the 0.3 milestone Aug 17, 2019

fstirlitz mentioned this issue Nov 2, 2019

Unicode Issue #77

Closed

fstirlitz self-assigned this Jan 15, 2020

This was referenced Jan 16, 2020

Lua 5.4 support #61

Open

Can not analyze symbols with Chinese #53

Open

fstirlitz mentioned this issue Apr 12, 2020

StringLiteral value is null at v0.3.0 #82

Closed

fstirlitz mentioned this issue Sep 3, 2020

Regression when parsing TableKey #86

Closed

fstirlitz mentioned this issue Apr 28, 2021

Add 'utf-8-lossy' encoding mode #100

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink string representation #68

Rethink string representation #68

fstirlitz commented Aug 17, 2019 •

edited

Loading

fstirlitz commented Aug 21, 2019

fstirlitz commented Jan 16, 2020

fstirlitz commented Feb 9, 2020 •

edited

Loading

fstirlitz commented Feb 22, 2020 •

edited

Loading

retorquere commented Apr 26, 2021

Rethink string representation #68

Rethink string representation #68

Comments

fstirlitz commented Aug 17, 2019 • edited Loading

fstirlitz commented Aug 21, 2019

fstirlitz commented Jan 16, 2020

fstirlitz commented Feb 9, 2020 • edited Loading

fstirlitz commented Feb 22, 2020 • edited Loading

retorquere commented Apr 26, 2021

fstirlitz commented Aug 17, 2019 •

edited

Loading

fstirlitz commented Feb 9, 2020 •

edited

Loading

fstirlitz commented Feb 22, 2020 •

edited

Loading