-
Notifications
You must be signed in to change notification settings - Fork 46
Fixes #56: Do not ignore unmatched high surrogates, check index is within bounds #57
Conversation
@zcbenz @kevinsawicki friendly ping :) This PR fixes a segmentation fault, it is sort of rare to run into invalid UTF-16, but it is possible. I can also explain in more detail what's going on if you think that'll help. |
We've just had this occur in the wild and not just from our internal testing https://twitter.com/matt_porter_/status/778019210806951936 I take full responsibility for introducing this bug when I've contributed the fast UTF16 - UTF8 offset conversion in the form of I would really appreciate if you could merge this PR and publish a new version of this module to npm, otherwise my hand is forced to fork for good as our users are impacted by it. |
Sorry for the delay, @alexandrudima. Thanks for tracking this down. |
@alexandrudima Though you added great tests for this case, I'm going to do a brief manual test of my own, then will publish |
@maxbrunsfeld ❤️ Many thanks! I would also appreciate your input on this. The challenge is:
Thank again! |
@alexandrudima Sorry for weighing in late on this, but did you consider avoiding the conversion to/from UTF8 altogether, using oniguruma's native support for UTF16? It seems like you should be able to change |
In the meantime, |
That would probably be a brilliant change! I did not change how oniguruma is used in my original PR #46 , I just changed/refactored the conversion code, which used to be done with a while loop for each offset individual conversion, meanwhile now it is computed up front and cached in OnigString. I don't have intimate knowledge about node's GCing rules and lacked confidence to pursue that change, but I think it would be the correct direction (in fact I'm unsure why this library didn't do that to begin with). |
This changes the way high surrogates are interpreted:
ConvertUtf8OffsetToUtf16
andConvertUtf16OffsetToUtf8