Full UTF-16 support #736

joshwd36 · 2020-09-29T18:37:12Z

Something I noticed when working on the string iterator (#704) is that currently strings are stored as ordinary Rust strings, which use UTF-8. However the javascript standard specifies that strings should use UTF-16. There are a few places where this difference is noticable. For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1. Similarly, "🙂".charAt(0) should return some representation of the first code unit, which has a value of \ud83d and cannot be represented as a normal string.

There are a few ways this could be implemented:

Continue storing strings as Rust strings, and using str::encode_utf16(). This has the downside that certain operations, such as getting the length, now have to iterate through the whole string. It also complicates storing individual codepoints, such as "🙂".charAt(0).
Use the widestring crate or similar. Strings would therefore be stored as U16String, and only converted to rust strings on display.
Just storing arrays of u16s. This probably wouldn't be a good idea as we'd probably end up reimplementing a lot of the functionality of the widestring crate.

The text was updated successfully, but these errors were encountered:

joshwd36 · 2021-01-18T23:31:38Z

I had a bit of a go at this and it's going to be quite challenging, especially with performance. There are a number of things that are only implemented for str or String, such as parsing, regex, and son, all of which would need costly conversions between UTF-8 and UTF-16. This would suggest it might be best to just use Rust strings, but then we have the issue of storing invalid strings, which JavaScript supports.

jasonwilliams · 2021-01-19T00:06:33Z

We might need to research how other engines deal with this issue @joshwd36, thanks for looking into it. What was he outcome widestring crate?

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

joshwd36 · 2021-01-19T00:17:28Z

What was he outcome widestring crate?

I used a local fork of it that I'd modified to give greater parity with String to do the investigation, replacing the inner value of RcString with it, and had a look at what errors there were. The main issue is that there is no way to have an as_str() method, so every time it uses a function that takes &str as an argument it forces an allocation to convert to a String. The only way I can see of mitigating that is using/rewriting libraries to use UTF-16 strings, which probably isn't feasible.

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

That seems to suggest that Firefox does use an internal UTF-16 representation, with the exception of the optimisation they're describing, although it also suggests they have a custom regex engine which presumably also operates on UTF-16 strings.

jedel1043 · 2021-10-07T18:10:04Z

https://github.com/rylev/const-utf16 looks promising to convert from Rust's string literals to UTF-16 literals on const and static contexts.

jasonwilliams · 2021-11-02T15:01:03Z

For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

jedel1043 · 2021-11-02T15:32:53Z

For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

It was an old bug we had, but @joshwd36 fixed it here:

87d9e9c#diff-796dedc2c80b4163e38e66d39288c24707abd5e32ff4151e32a561bf2b0488b7R959

Essentially, it was because UTF-8 considers any of its 8 bit, 16 bit, 24 bit or 32 bit variable code points as a whole "Unicode Scalar Value", and "🙂" can be represented in utf-8 with a single 16-bit scalar value (F0 9F 99 82), hence a length of 1. However, Javascript considers the length of a string as the number of code units within the string, and "🙂" needs two 16-bit code units to be encoded in UTF-16, hence a length of 2.

@raskad

…rner` (#2147) So, @raskad and myself had a short discussion about the state of #736, and we came to the conclusion that it would be a good time to implement our own string interner; partly because the `string-interner` crate is a bit unmaintained (as shown by Robbepop/string-interner#42 and Robbepop/string-interner#47), and partly because it would be hard to experiment with custom optimizations for UTF-16 strings. I still want to thank @Robbepop for the original implementation though, because some parts of this design have been shamelessly stolen from it 😅. Having said that, this PR is a complete reimplementation of the interner, but with some modifications to (hopefully!) make it a bit easier to experiment with UTF-16 strings, apply optimizations, and whatnot :)

jasonwilliams · 2022-10-03T11:49:44Z

@joshwd36 not sure if you saw but there’s now a PR for this #1659

I think it's time to address the elephant in the room. This Pull Request will (hopefully!) solve part of #736. This is a complete rewrite of `JsString`, but instead of storing `u8` bytes it stores `u16` words. The `encode!` macro (renamed to `utf16!` for simplicity) from the `const-utf16` crate allows us to create UTF-16 encoded arrays at compilation time. `JsString` implements `Deref<Target=[u16]>` to unlock the slice methods and possibly make some manipulations easier. However, we would need to create our own library of utilities for `JsString`.

Razican · 2022-10-21T08:42:17Z

This was closed in #1659

jevancc mentioned this issue Oct 9, 2020

Panic on invalid utf16 instead of an Error #778

Closed

devinus mentioned this issue Nov 19, 2020

Implement String.prototype.codePointAt #935

Merged

Razican added builtins PRs and Issues related to builtins/intrinsics enhancement New feature or request good first issue Good for newcomers labels Jan 11, 2021

Razican mentioned this issue Jan 11, 2021

String interning #279

Closed

raskad mentioned this issue Aug 9, 2021

Implement missing RegExp properties #1304

Closed

8 tasks

joshwd36 mentioned this issue Oct 3, 2021

Implement URI Handling Functions #894

Closed

jtara1 mentioned this issue Oct 4, 2021

string mod function, code_point_at, does not return correct code point of lone high surrogate #1632

Closed

jedel1043 mentioned this issue Oct 11, 2021

[Merged by Bors] - First prototype for new JsString using UTF-16 #1659

Closed

jedel1043 linked a pull request Feb 27, 2022 that will close this issue

[Merged by Bors] - First prototype for new JsString using UTF-16 #1659

Closed

jedel1043 mentioned this issue Jun 28, 2022

[Merged by Bors] - Remove string-interner dependency and implement custom string Interner #2147

Closed

jedel1043 added this to Boa pre-v1 Aug 22, 2022

jedel1043 moved this to In Progress in Boa pre-v1 Aug 22, 2022

jedel1043 self-assigned this Aug 22, 2022

Razican closed this as completed Oct 21, 2022

Repository owner moved this from In Progress to Done in Boa pre-v1 Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full UTF-16 support #736

Full UTF-16 support #736

joshwd36 commented Sep 29, 2020

joshwd36 commented Jan 18, 2021

jasonwilliams commented Jan 19, 2021 •

edited

Loading

joshwd36 commented Jan 19, 2021

jedel1043 commented Oct 7, 2021

jasonwilliams commented Nov 2, 2021 •

edited

Loading

jedel1043 commented Nov 2, 2021

jasonwilliams commented Oct 3, 2022

Razican commented Oct 21, 2022

Full UTF-16 support #736

Full UTF-16 support #736

Comments

joshwd36 commented Sep 29, 2020

joshwd36 commented Jan 18, 2021

jasonwilliams commented Jan 19, 2021 • edited Loading

joshwd36 commented Jan 19, 2021

jedel1043 commented Oct 7, 2021

jasonwilliams commented Nov 2, 2021 • edited Loading

jedel1043 commented Nov 2, 2021

jasonwilliams commented Oct 3, 2022

Razican commented Oct 21, 2022

jasonwilliams commented Jan 19, 2021 •

edited

Loading

jasonwilliams commented Nov 2, 2021 •

edited

Loading