-
-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full UTF-16 support #736
Comments
I had a bit of a go at this and it's going to be quite challenging, especially with performance. There are a number of things that are only implemented for |
We might need to research how other engines deal with this issue @joshwd36, thanks for looking into it. What was he outcome widestring crate? you may find https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ interesting |
I used a local fork of it that I'd modified to give greater parity with
That seems to suggest that Firefox does use an internal UTF-16 representation, with the exception of the optimisation they're describing, although it also suggests they have a custom regex engine which presumably also operates on UTF-16 strings. |
https://github.com/rylev/const-utf16 looks promising to convert from Rust's string literals to UTF-16 literals on const and static contexts. |
@joshwd36 or @jedel1043 are you able to explain a little more why this is happening? I would have expected the UTF-8 version to have a higher length but my knowledge in this area isn't great |
It was an old bug we had, but @joshwd36 fixed it here: 87d9e9c#diff-796dedc2c80b4163e38e66d39288c24707abd5e32ff4151e32a561bf2b0488b7R959 Essentially, it was because UTF-8 considers any of its 8 bit, 16 bit, 24 bit or 32 bit variable code points as a whole "Unicode Scalar Value", and "🙂" can be represented in utf-8 with a single 16-bit scalar value (F0 9F 99 82), hence a length of 1. However, Javascript considers the length of a string as the number of code units within the string, and "🙂" needs two 16-bit code units to be encoded in UTF-16, hence a length of 2. |
…rner` (#2147) So, @raskad and myself had a short discussion about the state of #736, and we came to the conclusion that it would be a good time to implement our own string interner; partly because the `string-interner` crate is a bit unmaintained (as shown by Robbepop/string-interner#42 and Robbepop/string-interner#47), and partly because it would be hard to experiment with custom optimizations for UTF-16 strings. I still want to thank @Robbepop for the original implementation though, because some parts of this design have been shamelessly stolen from it 😅. Having said that, this PR is a complete reimplementation of the interner, but with some modifications to (hopefully!) make it a bit easier to experiment with UTF-16 strings, apply optimizations, and whatnot :)
I think it's time to address the elephant in the room. This Pull Request will (hopefully!) solve part of #736. This is a complete rewrite of `JsString`, but instead of storing `u8` bytes it stores `u16` words. The `encode!` macro (renamed to `utf16!` for simplicity) from the `const-utf16` crate allows us to create UTF-16 encoded arrays at compilation time. `JsString` implements `Deref<Target=[u16]>` to unlock the slice methods and possibly make some manipulations easier. However, we would need to create our own library of utilities for `JsString`.
This was closed in #1659 |
Something I noticed when working on the string iterator (#704) is that currently strings are stored as ordinary Rust strings, which use UTF-8. However the javascript standard specifies that strings should use UTF-16. There are a few places where this difference is noticable. For example, the string
🙂
should have length 2 as it is made up of two code units, but currently shows as having length 1. Similarly,"🙂".charAt(0)
should return some representation of the first code unit, which has a value of\ud83d
and cannot be represented as a normal string.There are a few ways this could be implemented:
str::encode_utf16()
. This has the downside that certain operations, such as getting the length, now have to iterate through the whole string. It also complicates storing individual codepoints, such as"🙂".charAt(0)
.U16String
, and only converted to rust strings on display.u16
s. This probably wouldn't be a good idea as we'd probably end up reimplementing a lot of the functionality of the widestring crate.The text was updated successfully, but these errors were encountered: