-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect handling of unpaired surrogates in JS strings #1348
Comments
(Another possible solution is to create a new Rust type which is internally just a |
I think this means that JS strings aren't guaranteed to be valid utf-16, right? IT seems reasonable to me to add a Naming here may be a bit tricky though since |
Indeed, JS Strings and DOMStrings are not always valid UTF-16. This is actually a part of the spec. On the other hand, I believe USVString is valid UTF-16, and it's guaranteed to not contain surrogate pairs. I think this has some implications for web-sys as well: anything which returns a JS String or DOMString should return I'm not sure what should be returned for USVString. It could probably return |
This also has huge implications for function arguments: right now they tend to take However, it is common to want to call a JS function (which will return In that case the function cannot accept That's a pretty big breaking change for many methods in web-sys. And it also has a significant ergonomic cost, and it means that passing a I'm not really sure what the right option is here, they all seem bad. |
Servo’s experience is likely to be valuable here, with @SimonSapin’s WTF-8 as a large part of its solution to the problem. I haven’t thought very much about the practicalities or APIs of it, but there could be a place for integrating carefully with Tendril, or if code size will be a problem there, some subset of it. Again, probably worth asking Simon for his opinion on this. Something a little related to this matter that occurred to me this morning: one may want to perform serialisation/deserialisation of DOM strings with Serde (with serde-json or any other form), and one wants that to be as efficient as possible, so that working as close to the source representation as possible is desirable. |
@chris-morgan Yes, being able to reuse WTF-8 (or in this case actually WTF-16) would be really useful. I'm not sure how Serde integration will work, I imagine the WTF-16 type will impl Serde, so I think it should Just Work(tm)? |
(I think you misunderstood my Serde remark. What I mean is the string type of the likes of Well then, if we decide the Rust side needs to cope with unpaired surrogates, there are two possibilities:
But I don’t think these are the only two options that we have. Although I pointed out the problem and want it to be fixed in some what that won’t be playing whack-a-mole, I also declare that it would be a shame to damage APIs, performance and code size for everyone and everything, in order to satisfy a few corner cases, if we can instead determine satisfactory ways of working around those corner cases. Although you can encounter unmatched surrogates at any point in JS, I believe that the only situations in which you ever actually will are from key events, because characters in supplementary planes may come through in two parts. Nothing else actually gives you unmatched surrogates. More specifically still, then:
For the This then leaves the function should_ignore() {
var tmp = this.data;
return tmp && tmp.length == 1 && (tmp = tmp.charCodeAt(0)) >= 0xd800 && tmp < 0xdc00;
} Then, simply issue a recommendation that anything that touches This way, once the dust settles, we’ll get the very occasional user, who is normally going to have been making a new framework, that needs to be told “please fix your input event handler” because they wrote code that gets tripped up by this (and don’t expect that people will find this on their own). But we won’t have damaged ergonomics everywhere. Exposing additional functionality in wasm-bindgen or js-sys or wherever to convert between a JavaScript Please let me keep using |
It might be possible to use the Unicode Private Use Areas to losslessly encode the surrogate pairs. That would allow for WTF-8 to be represented internally as a Rust That would also allow for it to Deref to
Yes, that can be accomplished by using
I'm not convinced that's true, but maybe you are right and that the "bad" APIs are few and far between.
Of course that will continue to be possible regardless, I'm not sure how ergonomic it will be, though.
Please don't be. This is an important issue which would have eventually been found regardless. |
I think this wouldn’t be lossless, since two different input strings (one containing a lone surrogate, and one containing a private use area code point) could potentially be converted to the same string. Also note this is about lone aka unpaired surrogates. UTF-16 / WTF-16 surrogate code unit pairs should absolutely be converted to a single non-BMP code point (which is 4 bytes in UTF-8).
It’s arguable whether that should be part of the docs or if it’s an internal implementation detail, but the memory representation of
(Emphasis added.) In the linked issue:
I know ~nothing about what this code does, but why is the string that was just read from the text input immediately written back to it? |
I'm not Pauan, but I believe that code is implementing a React-style controlled component, where the DOM always reflects the current state as understood by the application. |
Of course that is true, but my understanding is that "all bets are off" with the Private Use Area, so maybe it's reasonable to assume that it won't be used?
Yes, of course. This bug report is about unpaired surrogates.
That's really interesting. I assume it returns
Dominator is a DOM library. In that example it is doing two-way data binding (which is very common with DOM libraries). Two-way data binding means that when the So when the Then when the second Even if it didn't write back the value, this issue would still happen, because the first |
Correct, where invalid characters mean unpaired surrogate byte sequences.
If not for this bug, this “write back” would be a no-op, wouldn’t it? If so, why do it at all?
Wouldn’t the second event overwrite the effects of the first? |
Okay, great. Perhaps we could use a similar API to handle this bug.
Yes, I'm pretty sure it is a no-op (though I'm not 100% sure on that). The reason is mostly because the APIs are currently designed to never miss any changes, since stale state is really not good. I have some plans to add in an optimization for this case to avoid the write-back. But in any case it's irrelevant to this issue, and this bug will still happen even without the write-back.
Not if the app has side effects during the first event. For example, if the app took that string and sent it to a server, or put it into a file, or put it into a data structure (like a It just so happens that in the dominator example the side effect was writing back to the DOM. But this issue is really not specific to dominator. |
Yes, such side effects of the first event could observe the intermediate string with a replacement character. But is this fundamentally different from intermediate events observing a string with letters that do not yet form a meaningful word because the user has not finished typing? Also, even with the conversion made lossless at the DOM ↔ wasm boundary (possibly by having a |
I don't understand this question. This is what is happening:
This is undesirable, because it means the string is wrong: it contains characters that the user never typed. You could argue that this is a bug in the browser, and it should send a single event. I agree with that, but we cannot change the browsers, so we must workaround this issue. This is completely unrelated to processing strings for words or anything like that, because in that case the string always contains the characters the user typed, and it is always valid Unicode (and so any mistakes would be logic errors in the word-splitting program). In addition, the solutions for them are completely different: word splitting would likely require some form of debouncing, whereas we must solve this encoding issue without debouncing. |
So modulo some naming, I think given all this I'd propose a solution that looks like:
The intention here is that there's optionally a type that we can use in I think I'd probably prefer to also say we shouldn't change Ok and for naming, one obvious other name would be |
|
I'm not sure I agree with that: since So if WTF-8 can faithfully represent everything (including private use areas), I think we should use it instead.
That's fine, though I think it has implications for Gloo (which I'll discuss on that repo).
I think I guess you could argue that using As another alternative, we could simply add in wasm-bindgen marshalling for the already-existing wtf8 crate (which is internally a That sidesteps the entire naming choice, and also enables no-copying conversion to |
It's true yeah that This feels very similar to the decision of how to represent @Pauan do you feel you've got a good handle on when we'd use |
I consider @alexcrichton’s proposed plan to be reasonable. I don’t think WTF-8 is the way forwards with this; it doesn’t solve problems, merely shifts which side you have to write encoding on, Rust or JavaScript. A One correction: |
@alexcrichton I think right now we just need a way of detecting if a JS string is valid UTF-16 or not. I see a couple options:
Either solution works fine. With that basic building block, it's now possible for the This completely sidesteps the issue: we no longer need to define |
@Pauan you're thinking that in these cases folks would take As to which of those options, I think it makes sense to leave |
@alexcrichton Well, I see two general use cases:
I don't actually know of any use cases for 2, so that's why I think solving 1 is good enough. So we only need And if later we actually do need to preserve unpaired surrogates, we can reinvestigate Basically, after carefully thinking about this issue, my perspective has softened a lot, so I no longer see the benefit of trying to preserve unpaired surrogates. |
Sounds like a solid plan to me! Would the addition in that case be: impl JsValue {
pub fn is_valid_utf16_string(&self) -> bool {
// some wasm_bindgen intrinsic
}
} |
@alexcrichton Yes, exactly. And the intrinsic would be implemented like this on the JS side (it has to be on the JS side, sadly): function isValidUtf16String(str) {
var i = 0;
var len = str.length;
while (i < len) {
var char = str.charCodeAt(i);
++i;
// Might be surrogate pair...
if (char >= 0xD800) {
// First half of surrogate pair
if (char <= 0xDBFF) {
// End of string
if (i === len) {
return false;
} else {
var next = str.charCodeAt(i);
// No second half
if (next < 0xDC00 || next > 0xDFFF) {
return false;
}
}
// Second half of surrogate pair
} else if (char <= 0xDFFF) {
return false;
}
}
}
return true;
} (I loosely translated it from the It looks big, but it minifies quite well, and it should be extremely fast. |
If we're leaving That would be more consistent with the standard library, where |
@lfairy I agree, but that will require a deprecation/breaking change cycle, so it's best to batch that together with a bunch of other breaking changes. |
@Pauan I think it depends on whether we want to rename or not as to whether we do it now, because if we do want to rename having a deprecation is actually pretty easy to handle at any time. I'm less certain though that we'd want to rename. It seems to me that it's far more common to use I think we probably want to just update documentation to indicate the hazard? |
@alexcrichton I don't have strong opinions about it: I really like the idea of being consistent with the Rust stdlib, but breaking changes aren't fun. And as you say, any bindings that return |
This commit aims to address rustwasm#1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Notably documenting that `as_string` and handling of arguments is lossy when there are lone surrogates. * A `JsString::is_valid_utf16` method was added to test whether `as_string` is lossless or not. The intention is that most default behavior of `wasm-bindgen` will remain, but where necessary bindings will use `JsString` instead of `str`/`String` and will manually check for `is_valid_utf16` as necessary. It's also hypothesized that this is relatively rare and not too performance critical, so an optimized intrinsic for `is_valid_utf16` is not yet provided. Closes rustwasm#1348
The original issue cites emoji input, not Chinese. Are you actually able to reproduce this with Chinese? (Perhaps by enabling HKSCS characters in Microsoft Bopomofo (in pre-Windows 10 "Advanced" settings); I have no idea how to type Hong Kong characters with a Mandarin pronunciation-based IME. Or more obviously, by enabling HKSCS characters in Microsoft Changjie (directly in the Windows 10-style settings).) Which browsers does the original emoji issue reproduce in? Is there a minimal non-wasm JS-only test case that demonstrates the issue? For this IME-triggered event issue, checking the whole string for lone surrogates, using WTF-8, or making UTF-16 to UTF-8 conversion report unpaired surrogates are all overkills: It's enough to check if the last code unit of the JS string is an unpaired surrogate and discard the event if so. Please don't make all DOM APIs use |
I haven't tested it, though all Unicode characters above a certain point use surrogate pairs, and the issue is about surrogate pairs, so I would expect it to behave the same. Maybe it doesn't, though.
I'll work on creating a reduced test case!
I don't think that's true: the user can input text anywhere in the
We have no plans to use The plan is to make the |
This commit aims to address rustwasm#1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Notably documenting that `as_string` and handling of arguments is lossy when there are lone surrogates. * A `JsString::is_valid_utf16` method was added to test whether `as_string` is lossless or not. The intention is that most default behavior of `wasm-bindgen` will remain, but where necessary bindings will use `JsString` instead of `str`/`String` and will manually check for `is_valid_utf16` as necessary. It's also hypothesized that this is relatively rare and not too performance critical, so an optimized intrinsic for `is_valid_utf16` is not yet provided. Closes rustwasm#1348
Okay, I created a reduced test case: http://paaru.pbworks.com/w/file/fetch/132853233/input-bug.html I tried to reproduce it with Chinese, but I wasn't able to. It turns out that Chinese characters are not encoded with surrogate pairs, since they're in the BMP. So it's only very rare historical Chinese characters which might be affected. And since (as far as I can tell) Windows doesn't have any way to input those characters, I wasn't able to reproduce that. However, I was able to reproduce it with emoji characters with the following browsers:
Interestingly, I could not reproduce it with these browsers (they do not have this bug):
P.S. I also verified that this bug happens if the user enters the emoji anywhere in the |
Thank you. Confirmed with the Windows 10 touch keyboard and Firefox. However, the emoji picker integrated into the Microsoft Pinyin IME does not reproduce this problem, which suggests that the touch keyboard is trying to generate the text using a non-IME iterface to the app.
I mentioned HKSCS in my previous comment, because some Hong Kong characters that may be rare but not necessarily historical are assigned to CJK Unified Ideographs Extension B. For example, Apple Cangjie, IBus (Ubuntu, etc.) Cangie, and Android GBoard Cangjie allow 𥄫 to be entered by typing bunhe (assuming QWERTY keycaps). Even after enabling HKSCS for Microsoft ChangJie (sic), I'm unable to enter the character using Microsoft's implementation. This is on Windows 10 1803. Another thing to test on a newer Windows 10 would be enabling the Adlam keyboard (Adlam is assigned to Plane 1) and seeing how it behaves.
OK. |
I should mention that none of these appear to expose an unpaired surrogate in Firefox. |
Entering astral characters with Microsoft Bopomofo doesn't appear to expose unpaired surrogates. (To test: In IME properties under Output Settings in Character Set, set the radio button to Unicode and check all boxes. Then type e.g. 1i6, press down arrow, press right arrow and then use the arrow keys and enter to choose any of the red characters in the palette.) |
This commit aims to address rustwasm#1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Notably documenting that `as_string` and handling of arguments is lossy when there are lone surrogates. * A `JsString::is_valid_utf16` method was added to test whether `as_string` is lossless or not. The intention is that most default behavior of `wasm-bindgen` will remain, but where necessary bindings will use `JsString` instead of `str`/`String` and will manually check for `is_valid_utf16` as necessary. It's also hypothesized that this is relatively rare and not too performance critical, so an optimized intrinsic for `is_valid_utf16` is not yet provided. Closes rustwasm#1348
This commit aims to address rustwasm#1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Notably documenting that `as_string` and handling of arguments is lossy when there are lone surrogates. * A `JsString::is_valid_utf16` method was added to test whether `as_string` is lossless or not. The intention is that most default behavior of `wasm-bindgen` will remain, but where necessary bindings will use `JsString` instead of `str`/`String` and will manually check for `is_valid_utf16` as necessary. It's also hypothesized that this is relatively rare and not too performance critical, so an optimized intrinsic for `is_valid_utf16` is not yet provided. Closes rustwasm#1348
This commit aims to address rustwasm#1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Notably documenting that `as_string` and handling of arguments is lossy when there are lone surrogates. * A `JsString::is_valid_utf16` method was added to test whether `as_string` is lossless or not. The intention is that most default behavior of `wasm-bindgen` will remain, but where necessary bindings will use `JsString` instead of `str`/`String` and will manually check for `is_valid_utf16` as necessary. It's also hypothesized that this is relatively rare and not too performance critical, so an optimized intrinsic for `is_valid_utf16` is not yet provided. Closes rustwasm#1348
Describe the Bug
It was brought to my attention in Pauan/rust-dominator#10 that JavaScript strings (and DOMString) allow for unpaired surrogates.
When using
TextEncoder
, it will convert those unpaired surrogates into U+FFFD (the replacement character). According to the Unicode spec, this is correct behavior.The issue is that because the unpaired surrogates are replaced, this is lossy, and that lossiness can cause serious issues.
You can read the above dominator bug report for the nitty gritty details, but the summary is that with
<input>
fields (and probably other things), it will send twoinput
events, one for each surrogate.When the first event arrives, the surrogate is unpaired, so because the string is immediately sent to Rust, the unpaired surrogate is converted into the replacement character.
Then the second event arrives, and the surrogate is still unpaired (because the first half was replaced), so the second half also gets replaced with the replacement character.
This has a lot of very deep implications, including for international languages (e.g. Chinese).
I did quite a bit of reading, and unfortunately I think the only real solution here is to always use
JsString
, and not convert into RustString
, because that is inherently lossy. Or if a conversion is done, it needs to do some checks to make sure that there aren't any unpaired surrogates.The text was updated successfully, but these errors were encountered: