Skip to content

Commit

Permalink
Add warnings about UTF-16 vs UTF-8 strings
Browse files Browse the repository at this point in the history
This commit aims to address #1348 via a number of strategies:

* Documentation is updated to warn about UTF-16 vs UTF-8 problems
  between JS and Rust. Notably documenting that `as_string` and handling
  of arguments is lossy when there are lone surrogates.

* A `JsString::is_valid_utf16` method was added to test whether
  `as_string` is lossless or not.

The intention is that most default behavior of `wasm-bindgen` will
remain, but where necessary bindings will use `JsString` instead of
`str`/`String` and will manually check for `is_valid_utf16` as
necessary. It's also hypothesized that this is relatively rare and not
too performance critical, so an optimized intrinsic for `is_valid_utf16`
is not yet provided.

Closes #1348
  • Loading branch information
alexcrichton committed Apr 4, 2019
1 parent c5f18b6 commit 2452d25
Show file tree
Hide file tree
Showing 6 changed files with 89 additions and 1 deletion.
31 changes: 31 additions & 0 deletions crates/js-sys/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3522,6 +3522,37 @@ impl JsString {
None
}
}

/// Returns whether this string is a valid UTF-16 string.
///
/// This is useful for learning whether `String::from(..)` will return a
/// lossless representation of the JS string. If this string contains
/// unpaired surrogates then `String::from` will succeed but it will be a
/// lossy representation of the JS string because unpaired surrogates will
/// become replacement characters.
///
/// If this function returns `false` then to get a lossless representation
/// of the string you'll need to manually use `iter` method (or
/// `char_code_at` accessor) to view the raw code points.
///
/// For more information, see the documentation on [JS strings vs Rust
/// strings][docs]
///
/// [docs]: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html
pub fn is_valid_utf16(&self) -> bool {
std::char::decode_utf16(self.iter()).all(|i| i.is_ok())
}

/// Returns an iterator over the `u16` character codes that make up this JS
/// string.
///
/// This method will call `char_code_at` for each code in this JS string,
/// returning an iterator of the codes in sequence.
pub fn iter<'a>(
&'a self,
) -> impl ExactSizeIterator<Item = u16> + DoubleEndedIterator<Item = u16> + 'a {
(0..self.length()).map(move |i| self.char_code_at(i) as u16)
}
}

impl PartialEq<str> for JsString {
Expand Down
12 changes: 12 additions & 0 deletions crates/js-sys/tests/wasm/JsString.rs
Original file line number Diff line number Diff line change
Expand Up @@ -541,3 +541,15 @@ fn raw() {
);
assert!(JsString::raw_0(&JsValue::null().unchecked_into()).is_err());
}

#[wasm_bindgen_test]
fn is_valid_utf16() {
assert!(JsString::from("a").is_valid_utf16());
assert!(JsString::from("").is_valid_utf16());
assert!(JsString::from("🥑").is_valid_utf16());
assert!(JsString::from("Why hello there this, 🥑, is 🥑 and is 🥑").is_valid_utf16());

assert!(JsString::from_char_code1(0x00).is_valid_utf16());
assert!(!JsString::from_char_code1(0xd800).is_valid_utf16());
assert!(!JsString::from_char_code1(0xdc00).is_valid_utf16());
}
7 changes: 6 additions & 1 deletion examples/without-a-bundler/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,12 @@
// Also note that the promise, when resolved, yields the wasm module's
// exports which is the same as importing the `*_bg` module in other
// modes
await init('./pkg/without_a_bundler_bg.wasm');
// await init('./pkg/without_a_bundler_bg.wasm');

const url = await fetch('http://localhost:8001/pkg/without_a_bundler_bg.wasm');
const body = await url.arrayBuffer();
const module = await WebAssembly.compile(body);
await init(module);

// And afterwards we can use all the functionality defined in wasm.
const result = add(1, 2);
Expand Down
27 changes: 27 additions & 0 deletions guide/src/reference/types/str.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,30 @@ with handles to JavaScript string values, use the `js_sys::JsString` type.
```js
{{#include ../../../../examples/guide-supported-types-examples/str.js}}
```

## UTF-16 vs UTF-8

Strings in JavaScript are encoded as UTF-16, but with one major exception: they
can contain unpaired surrogates. For some Unicode characters UTF-16 uses two
16-byte values. These are called "surrogate pairs" because they always come in
pairs. In JavaScript, it is possible for these surrogate pairs to be missing the
other half, creating an "unpaired surrogate".

When passing a string from JavaScript to Rust, it uses the `TextEncoder` API to
convert from UTF-16 to UTF-8. This is normally perfectly fine... unless there
are unpaired surrogates. In that case it will replace the unpaired surrogates
with U+FFFD (�, the replacement character). That means the string in Rust is
now different from the string in JavaScript!

If you want to guarantee that the Rust string is the same as the JavaScript
string, you should instead use `js_sys::JsString` (which keeps the string in
JavaScript and doesn't copy it into Rust).

If you want to access the raw value of a JS string, you can use `JsString::iter`,
which returns an `Iterator<Item = u16>`. This perfectly preserves everything
(including unpaired surrogates), but it does not do any encoding (so you
have to do that yourself!).

If you simply want to ignore strings which contain unpaired surrogates, you can
use `JsString::is_valid_utf16` to test whether the string contains unpaired
surrogates or not.
3 changes: 3 additions & 0 deletions guide/src/reference/types/string.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ Copies the string's contents back and forth between the JavaScript
garbage-collected heap and the Wasm linear memory with `TextDecoder` and
`TextEncoder`

> **Note**: Be sure to check out the [documentation for `str`](str.html) to
> learn about some caveats when working with strings between JS and Rust.
## Example Rust Usage

```rust
Expand Down
10 changes: 10 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,16 @@ impl JsValue {
///
/// If this JS value is not an instance of a string or if it's not valid
/// utf-8 then this returns `None`.
///
/// # UTF-16 vs UTF-8
///
/// JavaScript strings in general are encoded as UTF-16, but Rust strings
/// are encoded as UTF-8. This can cause the Rust string to look a bit
/// different than the JS string sometimes. For more details see the
/// [documentation about the `str` type][caveats] which contains a few
/// caveats about the encodings.
///
/// [caveats]: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html
#[cfg(feature = "std")]
pub fn as_string(&self) -> Option<String> {
unsafe {
Expand Down

0 comments on commit 2452d25

Please sign in to comment.