Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full UTF-16 support #736

Closed
joshwd36 opened this issue Sep 29, 2020 · 8 comments
Closed

Full UTF-16 support #736

joshwd36 opened this issue Sep 29, 2020 · 8 comments
Assignees
Labels
builtins PRs and Issues related to builtins/intrinsics enhancement New feature or request good first issue Good for newcomers

Comments

@joshwd36
Copy link
Contributor

Something I noticed when working on the string iterator (#704) is that currently strings are stored as ordinary Rust strings, which use UTF-8. However the javascript standard specifies that strings should use UTF-16. There are a few places where this difference is noticable. For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1. Similarly, "🙂".charAt(0) should return some representation of the first code unit, which has a value of \ud83d and cannot be represented as a normal string.

There are a few ways this could be implemented:

  • Continue storing strings as Rust strings, and using str::encode_utf16(). This has the downside that certain operations, such as getting the length, now have to iterate through the whole string. It also complicates storing individual codepoints, such as "🙂".charAt(0).
  • Use the widestring crate or similar. Strings would therefore be stored as U16String, and only converted to rust strings on display.
  • Just storing arrays of u16s. This probably wouldn't be a good idea as we'd probably end up reimplementing a lot of the functionality of the widestring crate.
@Razican Razican added builtins PRs and Issues related to builtins/intrinsics enhancement New feature or request good first issue Good for newcomers labels Jan 11, 2021
@joshwd36
Copy link
Contributor Author

I had a bit of a go at this and it's going to be quite challenging, especially with performance. There are a number of things that are only implemented for str or String, such as parsing, regex, and son, all of which would need costly conversions between UTF-8 and UTF-16. This would suggest it might be best to just use Rust strings, but then we have the issue of storing invalid strings, which JavaScript supports.

@jasonwilliams
Copy link
Member

jasonwilliams commented Jan 19, 2021

We might need to research how other engines deal with this issue @joshwd36, thanks for looking into it. What was he outcome widestring crate?

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

@joshwd36
Copy link
Contributor Author

What was he outcome widestring crate?

I used a local fork of it that I'd modified to give greater parity with String to do the investigation, replacing the inner value of RcString with it, and had a look at what errors there were. The main issue is that there is no way to have an as_str() method, so every time it uses a function that takes &str as an argument it forces an allocation to convert to a String. The only way I can see of mitigating that is using/rewriting libraries to use UTF-16 strings, which probably isn't feasible.

you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting

That seems to suggest that Firefox does use an internal UTF-16 representation, with the exception of the optimisation they're describing, although it also suggests they have a custom regex engine which presumably also operates on UTF-16 strings.

@jedel1043
Copy link
Member

https://github.com/rylev/const-utf16 looks promising to convert from Rust's string literals to UTF-16 literals on const and static contexts.

@jasonwilliams
Copy link
Member

jasonwilliams commented Nov 2, 2021

For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

@jedel1043
Copy link
Member

For example, the string 🙂 should have length 2 as it is made up of two code units, but currently shows as having length 1

@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great

It was an old bug we had, but @joshwd36 fixed it here:

87d9e9c#diff-796dedc2c80b4163e38e66d39288c24707abd5e32ff4151e32a561bf2b0488b7R959

Essentially, it was because UTF-8 considers any of its 8 bit, 16 bit, 24 bit or 32 bit variable code points as a whole "Unicode Scalar Value", and "🙂" can be represented in utf-8 with a single 16-bit scalar value (F0 9F 99 82), hence a length of 1. However, Javascript considers the length of a string as the number of code units within the string, and "🙂" needs two 16-bit code units to be encoded in UTF-16, hence a length of 2.

bors bot pushed a commit that referenced this issue Jun 30, 2022
…rner` (#2147)

So, @raskad and myself had a short discussion about the state of #736, and we came to the conclusion that it would be a good time to implement our own string interner; partly because the `string-interner` crate is a bit unmaintained (as shown by Robbepop/string-interner#42 and Robbepop/string-interner#47), and partly because it would be hard to experiment with custom optimizations for UTF-16 strings. I still want to thank @Robbepop for the original implementation though, because some parts of this design have been shamelessly stolen from it 😅.

Having said that, this PR is a complete reimplementation of the interner, but with some modifications to (hopefully!) make it a bit easier to experiment with UTF-16 strings, apply optimizations, and whatnot :)
@jedel1043 jedel1043 moved this to In Progress in Boa pre-v1 Aug 22, 2022
@jedel1043 jedel1043 self-assigned this Aug 22, 2022
@jasonwilliams
Copy link
Member

@joshwd36 not sure if you saw but there’s now a PR for this #1659

bors bot pushed a commit that referenced this issue Oct 11, 2022
I think it's time to address the elephant in the room.

This Pull Request will (hopefully!) solve part of #736.

This is a complete rewrite of `JsString`, but instead of storing `u8` bytes it stores `u16` words. The `encode!` macro (renamed to `utf16!` for simplicity) from the `const-utf16` crate allows us to create UTF-16 encoded arrays at compilation time. `JsString` implements `Deref<Target=[u16]>` to unlock the slice methods and possibly make some manipulations easier. However, we would need to create our own library of utilities for `JsString`.
@Razican
Copy link
Member

Razican commented Oct 21, 2022

This was closed in #1659

@Razican Razican closed this as completed Oct 21, 2022
Repository owner moved this from In Progress to Done in Boa pre-v1 Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
builtins PRs and Issues related to builtins/intrinsics enhancement New feature or request good first issue Good for newcomers
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants