-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should not prepend gb18030 second and third when range decode fails upon fourth #110
Comments
Nevermind. This can be addressed by adding more state to the decoder. |
On third thought: Having two errors would then be unprecedented in another way, which even with the added state from the previous comment would mess up encoding_rs's mechanism for designating which bytes a given U+FFFD corresponds to (not currently made use of in Firefox). |
Hmm. On fourth thought, maybe the machinery I put in place for ISO-2022-JP could work here. Let's do what's best for the Web and I'll sort out the error reporting. |
@jungshik any thoughts with regards to ICU? For four byte sequences we're dealing with this:
Now if 4 is not an ASCII digit it sounds like @hsivonen is suggesting it's reasonable to unwind, U+FFFD 1, and reprocess from 2. If 4 is an ASCII digit, but there's no code point, the question is what to do then:
|
90 31 90 40 yields "�1怈" in Chrome/Firefox/Safari. So given that they do seem to consistently unwind there. So the main question here is what to do when the pointer lookup goes astray. |
Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes #110.
Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes #110.
Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes #110.
Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes #110.
Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes whatwg#110.
Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes whatwg#110.
Consider a run of byte pairs where each pair starts with a byte that's valid as first or third and the second byte is valid as second or fourth in a four-byte gb18030 sequence.
The first two pairs form a four-byte sequence that fails the gb18030 ranges lookup. The spec, as written, realigns the decoder such that the second pair of the first out-of-range four byte sequence is next treated as the first half of another four-byte sequence.
The test case linked to above decodes to four replacement characters in Firefox, Chromium and into four question marks in IE and Edge. (Didn't test Safari.) However, per spec, the run decodes into "�9¶ĭ¶�9".
One error causing the decoder to misalign such that subsequent non-error characters come into existence where there previously were none in browsers can't be good.
Suggested fix:
When processing the fourth byte in a sequence, first check it it is in the range 0x30 to 0x39 inclusive. If it isn't, prepend the second, third and fourth bytes in the sequence per current spec and return error.
Then set code point to the index gb18030 ranges code point. If code point is null, prepend only the fourth byte and return error.
This would deviate from the current browser behavior by unmasking the last ASCII byte in the sequence the way we always unmask failed ASCII in two-byte sequences. However, unlike in the current spec, the ASCII byte that's second in the sequence would get swallowed into a REPLACEMENT CHARACTER. This is fine, because we know that ASCII byte had non-ASCII bytes before and after.
(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)
The text was updated successfully, but these errors were encountered: