Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink string representation #68

Open
fstirlitz opened this issue Aug 17, 2019 · 5 comments
Open

Rethink string representation #68

fstirlitz opened this issue Aug 17, 2019 · 5 comments
Assignees
Labels
compat hazard Resolving this issue may create backwards compatibility problems
Milestone

Comments

@fstirlitz
Copy link
Owner

fstirlitz commented Aug 17, 2019

(cribbed from README.md)

Unlike strings in JavaScript, Lua strings are not Unicode strings, but bytestrings (sequences of 8-bit values); likewise, implementations of Lua parse the source code as a sequence of octets. However, the input to this parser is a JavaScript string, i.e. a sequence of 16-bit code units (not necessarily well-formed UTF-16). This poses a problem of how those code units should be interpreted, particularly if they are outside the Basic Latin block ('ASCII').

Currently, this parser handles Unicode input by encoding it in WTF-8, and reinterpreting the resulting code units as Unicode code points. This applies to string literals and (if extendedIdentifiers is enabled) to identifiers as well. Lua byte escapes inside string literals are interpreted directly as code points, while Lua 5.3 \u{} escapes are similarly decoded as UTF-8 code units reinterpreted as code points. It is as if the parser input was being interpreted as ISO-8859-1, while actually being encoded in UTF-8.

This ensures that no otherwise-valid input will be rejected due to encoding errors. Assuming the input was originally encoded in UTF-8 (which includes the case of only containing ASCII characters), it also preserves the following properties:

  • String literals (and identifiers, if extendedIdentifiers is enabled) will have the same representation in the AST if and only if they represent the same string in the source code: e.g. the Lua literals '💩', '\u{1f4a9}' and '\240\159\146\169' will all have "\u00f0\u009f\u0092\u00a9" in their .value property, and likewise local 💩 will have the same string in its .name property;
  • The String.prototype.charCodeAt method in JS can be directly used to emulate Lua's string.byte (with one argument, after shifting offsets by 1), and likewise String.prototype.substr can be used similarly to Lua's string.sub;
  • The .length property of decoded string values in the AST is equal to the value that the # operator would return in Lua.

Maintaining those properties makes the logic of static analysers and code transformation tools simpler. However, it poses a problem when displaying strings to the user and serialising AST back into a string; to recover the original bytestrings, values transformed in this way will have to be encoded in ISO-8859-1.

Other solutions to this problem may be considered in the future. Some of them have been listed below, with their drawbacks:

  1. A mode that instead treats the input as if it were decoded according to ISO-8859-1 (or the x-user-defined encoding) and rejects code points that cannot appear in that encoding; may be useful for source code in encodings other than UTF-8
    • Still tricky to get semantics correctly
    • x-user-defined cannot take advantage of compact representation of ISO-8859-1 strings in certain JavaScript engines
  2. Using an ArrayBuffer or Uint8Array for source code and/or string literals
    • May fail to be portable to older JavaScript engines
    • Cannot be (directly) serialised as JSON
    • Values of those types are fixed-length, which makes manipulation cumbersome; they cannot be incrementally built by appending.
    • They cannot be used as keys in objects; one has to use Map and WeakMap instead
  3. Using a plain Array of numbers in the range [0, 256)
    • May be memory-inefficient in naïve JavaScript engines
    • May bloat the JSON serialisation considerably
    • Cannot be used as keys in objects either
  4. Storing string literal values as ordinary String values, and requiring that escape sequences in literals constitute well-formed UTF-8; an exception is thrown if they do not
    • UTF-8 chauvinism; imposes semantics that may be unwanted
    • Reduced compatibility with other Lua implementations
  5. Like above, but instead of throwing an exception, ill-formed escapes are transformed to unpaired surrogates, just like Python's surrogateescape encoding error handler
    • UTF-8 chauvinism, though to a lesser extent
    • Destroys the property that ("\xc4" .. "\x99") == "\xc4\x99"
    • If the AST is encoded in JSON, some JSON libraries may refuse to parse it

Cf. discussion under c05822d.

@fstirlitz fstirlitz added the compat hazard Resolving this issue may create backwards compatibility problems label Aug 17, 2019
@fstirlitz fstirlitz added this to the 0.3 milestone Aug 17, 2019
@fstirlitz
Copy link
Owner Author

I will probably add a switch to toggle between these modes:

  • no interpretation for string literals at all; extended identifiers not mangled
  • pseudo-ISO-8859-1/x-user-defined (option 0)
  • UTF-8 (either current behaviour or option 3/4)

@fstirlitz fstirlitz mentioned this issue Nov 2, 2019
@fstirlitz fstirlitz self-assigned this Jan 15, 2020
@fstirlitz
Copy link
Owner Author

Got some WIP code that implements an encodingMode option, allowing to switch between:

  • current behaviour
  • no mangling for identifiers, .value of string literal nodes is null
  • ISO-8859-1
  • x-user-defined.

A 'true UTF-8' mode (option 3 or 4) would be considerably mode involved, and perhaps not worth it. Still considering it, though.

This was referenced Jan 16, 2020
@fstirlitz
Copy link
Owner Author

fstirlitz commented Feb 9, 2020

I changed my mind; I won't keep current behaviour as the default, maybe I won't even keep it as an option; nobody seems to expect or desire it anyway. The default will be no mangling and no literal interpretation; this allows users to parse Unicode source code without hassle, while those interested in string literals can choose some other mode that ensures a coherent interpretation.

I implemented UTF-8 modes too, but they're a little hacky. I also still need to document the option.

@fstirlitz
Copy link
Owner Author

fstirlitz commented Feb 22, 2020

Finally committed as fstirlitz:2b04739...fstirlitz:10666c7.

Leaving out UTF-8 modes for the moment; I may add them later. I’m leaving this issue open until I make a decision, but either way it goes, it’s not a release blocker.

@retorquere
Copy link

I'd still be interested in UTF-8. I've tried reading up on x-user-defined but did not come away with an understanding where it would break down -- I am interested in literal strings as I want to use luaparse to change lua source into JS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compat hazard Resolving this issue may create backwards compatibility problems
Projects
None yet
Development

No branches or pull requests

2 participants