Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Fixes #56: Do not ignore unmatched high surrogates, check index is within bounds #57

Merged
merged 1 commit into from
Sep 20, 2016

Conversation

alexdima
Copy link
Contributor

@alexdima alexdima commented Aug 26, 2016

This changes the way high surrogates are interpreted:

  • before, they would be simply skipped if they would not be immediately followed by a low surrogate and the utf8->utf16 offset map would diverge from v8's conversion of utf16->utf8.
  • now, high surrogates are interpreted as code points if they are not immediately followed by a low surrogate. From my debugging the utf8 value of strings, this aligns with v8's handling of this invalid case
  • I also added bound checks in ConvertUtf8OffsetToUtf16 and ConvertUtf16OffsetToUtf8

@alexdima
Copy link
Contributor Author

alexdima commented Sep 6, 2016

@zcbenz @kevinsawicki friendly ping :)

This PR fixes a segmentation fault, it is sort of rare to run into invalid UTF-16, but it is possible. I can also explain in more detail what's going on if you think that'll help.

@alexdima
Copy link
Contributor Author

alexdima commented Sep 20, 2016

@zcbenz @kevinsawicki @50Wliu

We've just had this occur in the wild and not just from our internal testing https://twitter.com/matt_porter_/status/778019210806951936

I take full responsibility for introducing this bug when I've contributed the fast UTF16 - UTF8 offset conversion in the form of OnigString and I'm sorry for it.

I would really appreciate if you could merge this PR and publish a new version of this module to npm, otherwise my hand is forced to fork for good as our users are impacted by it.

@maxbrunsfeld
Copy link
Contributor

Sorry for the delay, @alexandrudima. Thanks for tracking this down.

@maxbrunsfeld maxbrunsfeld merged commit 62d0778 into atom:master Sep 20, 2016
@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Sep 20, 2016

@alexandrudima Though you added great tests for this case, I'm going to do a brief manual test of my own, then will publish v6.1.1 shortly.

@alexdima
Copy link
Contributor Author

@maxbrunsfeld ❤️ Many thanks! I would also appreciate your input on this.

The challenge is:

  • we get from JS the string in UTF16 form
  • we use oniguruma with UTF8 strings
  • we use v8's (or node's?) conversion from UTF16 to UTF8
  • we need a fast way to convert UTF16 offsets to UTF8 offsets and vice-versa
  • this code creates the mappings up front because in the course of tokenizing a string, there are many offset conversions needed, especially on long lines.
  • the challenge is to have the offsets 100% reflect the UTF16-UTF8 conversion algorithm
  • I believe v8's conversion code is at https://github.com/v8/v8/blob/d383430d932f0eb7d8e832feeb9b60f5666f31de/src/inspector/String16.cpp#L76 , but I am unsure if node supersedes that with a different implementation

Thank again!

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Sep 20, 2016

@alexandrudima Sorry for weighing in late on this, but did you consider avoiding the conversion to/from UTF8 altogether, using oniguruma's native support for UTF16? It seems like you should be able to change OnigRegExp::OnigRegExp to use onig_new_deluxe, which allows a target encoding to be specified. Then, in OnigScanner::OnigScanner, rather than using v8::String::Utf8Value, you could use v8::String::Write to copy the UTF16 contents of the v8 string directly.

@maxbrunsfeld
Copy link
Contributor

In the meantime, v6.1.1 is out.

@alexdima
Copy link
Contributor Author

That would probably be a brilliant change!

I did not change how oniguruma is used in my original PR #46 , I just changed/refactored the conversion code, which used to be done with a while loop for each offset individual conversion, meanwhile now it is computed up front and cached in OnigString.

I don't have intimate knowledge about node's GCing rules and lacked confidence to pursue that change, but I think it would be the correct direction (in fact I'm unsure why this library didn't do that to begin with).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants