Fixes #56: Do not ignore unmatched high surrogates, check index is within bounds #57

alexdima · 2016-08-26T10:06:38Z

This changes the way high surrogates are interpreted:

before, they would be simply skipped if they would not be immediately followed by a low surrogate and the utf8->utf16 offset map would diverge from v8's conversion of utf16->utf8.
now, high surrogates are interpreted as code points if they are not immediately followed by a low surrogate. From my debugging the utf8 value of strings, this aligns with v8's handling of this invalid case
I also added bound checks in ConvertUtf8OffsetToUtf16 and ConvertUtf16OffsetToUtf8

…thin bounds

alexdima · 2016-09-06T07:42:32Z

This PR fixes a segmentation fault, it is sort of rare to run into invalid UTF-16, but it is possible. I can also explain in more detail what's going on if you think that'll help.

alexdima · 2016-09-20T19:24:59Z

@zcbenz @kevinsawicki @50Wliu

We've just had this occur in the wild and not just from our internal testing https://twitter.com/matt_porter_/status/778019210806951936

I take full responsibility for introducing this bug when I've contributed the fast UTF16 - UTF8 offset conversion in the form of OnigString and I'm sorry for it.

I would really appreciate if you could merge this PR and publish a new version of this module to npm, otherwise my hand is forced to fork for good as our users are impacted by it.

maxbrunsfeld · 2016-09-20T19:52:38Z

Sorry for the delay, @alexandrudima. Thanks for tracking this down.

maxbrunsfeld · 2016-09-20T19:55:58Z

@alexandrudima Though you added great tests for this case, I'm going to do a brief manual test of my own, then will publish v6.1.1 shortly.

alexdima · 2016-09-20T20:13:34Z

@maxbrunsfeld ❤️ Many thanks! I would also appreciate your input on this.

The challenge is:

we get from JS the string in UTF16 form
we use oniguruma with UTF8 strings
we use v8's (or node's?) conversion from UTF16 to UTF8
we need a fast way to convert UTF16 offsets to UTF8 offsets and vice-versa
this code creates the mappings up front because in the course of tokenizing a string, there are many offset conversions needed, especially on long lines.
the challenge is to have the offsets 100% reflect the UTF16-UTF8 conversion algorithm
I believe v8's conversion code is at https://github.com/v8/v8/blob/d383430d932f0eb7d8e832feeb9b60f5666f31de/src/inspector/String16.cpp#L76 , but I am unsure if node supersedes that with a different implementation

Thank again!

maxbrunsfeld · 2016-09-20T20:34:19Z

@alexandrudima Sorry for weighing in late on this, but did you consider avoiding the conversion to/from UTF8 altogether, using oniguruma's native support for UTF16? It seems like you should be able to change OnigRegExp::OnigRegExp to use onig_new_deluxe, which allows a target encoding to be specified. Then, in OnigScanner::OnigScanner, rather than using v8::String::Utf8Value, you could use v8::String::Write to copy the UTF16 contents of the v8 string directly.

maxbrunsfeld · 2016-09-20T20:39:16Z

In the meantime, v6.1.1 is out.

alexdima · 2016-09-20T21:43:59Z

That would probably be a brilliant change!

I did not change how oniguruma is used in my original PR #46 , I just changed/refactored the conversion code, which used to be done with a while loop for each offset individual conversion, meanwhile now it is computed up front and cached in OnigString.

I don't have intimate knowledge about node's GCing rules and lacked confidence to pursue that change, but I think it would be the correct direction (in fact I'm unsure why this library didn't do that to begin with).

Fixes #56: Do not ignore unmatched high surrogates, check index is wi…

8c23095

…thin bounds

alexdima mentioned this pull request Aug 26, 2016

Crash in node-oniguruma when tokenizing this malformed string with the JS grammar microsoft/vscode#10945

Closed

50Wliu added the needs-review label Aug 26, 2016

alexdima mentioned this pull request Sep 20, 2016

Multi-byte entry crashes code microsoft/vscode#12329

Closed

maxbrunsfeld merged commit 62d0778 into atom:master Sep 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #56: Do not ignore unmatched high surrogates, check index is within bounds #57

Fixes #56: Do not ignore unmatched high surrogates, check index is within bounds #57

alexdima commented Aug 26, 2016 •

edited

Loading

alexdima commented Sep 6, 2016

alexdima commented Sep 20, 2016 •

edited

Loading

maxbrunsfeld commented Sep 20, 2016

maxbrunsfeld commented Sep 20, 2016 •

edited

Loading

alexdima commented Sep 20, 2016

maxbrunsfeld commented Sep 20, 2016 •

edited

Loading

maxbrunsfeld commented Sep 20, 2016

alexdima commented Sep 20, 2016

Fixes #56: Do not ignore unmatched high surrogates, check index is within bounds #57

Fixes #56: Do not ignore unmatched high surrogates, check index is within bounds #57

Conversation

alexdima commented Aug 26, 2016 • edited Loading

alexdima commented Sep 6, 2016

alexdima commented Sep 20, 2016 • edited Loading

maxbrunsfeld commented Sep 20, 2016

maxbrunsfeld commented Sep 20, 2016 • edited Loading

alexdima commented Sep 20, 2016

maxbrunsfeld commented Sep 20, 2016 • edited Loading

maxbrunsfeld commented Sep 20, 2016

alexdima commented Sep 20, 2016

alexdima commented Aug 26, 2016 •

edited

Loading

alexdima commented Sep 20, 2016 •

edited

Loading

maxbrunsfeld commented Sep 20, 2016 •

edited

Loading

maxbrunsfeld commented Sep 20, 2016 •

edited

Loading