-
Notifications
You must be signed in to change notification settings - Fork 46
Avoid UTF8 to UTF16 character offset conversions #63
Conversation
Just use oniguruma's native UTF16 support.
String::Utf8Value utf8Value(sources->Get(i)); | ||
regExps[i] = shared_ptr<OnigRegExp>(new OnigRegExp(string(*utf8Value))); | ||
Local<String> source = Local<String>::Cast(sources->Get(i)); | ||
regExps[i] = shared_ptr<OnigRegExp>(new OnigRegExp(OnigString(source))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the type aliases in onig-searcher.h
, and use a fully qualified reference to std::shared_ptr<T>
as you've done elsewhere? This file probably never should have relied on those aliases in that header.
PS. Dropping those aliases will obviously require updating the type references in other files, most notably onig-searcher.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is my first PR on this repo, and I noticed we had a bunch of using
statements in headers. I removed a couple of them in files that I touched, but I don't think I want to update the styling of the rest of the code at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes perfect sense. Can always be done later-ish. Other that this one comment, I reckon this PR looks stellar ✨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving this a 👀 ! Always good to have more eyes on our c++ code.
|
||
// Characters after unmatched high surrogates are not found. | ||
match = scanner.findNextMatchSync(`X${String.fromCharCode(0xd83c)}X`, 1) | ||
expect(match).toBeNull() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one behavior that's changed because we no longer transcode to UTF8. When dealing with invalid UTF16 (specifically, an unmatched high surrogate character), the unit that follows will no longer be considered as its own valid character, since it was supposed to contain the other half of the surrogate pair.
Previously, the second X
would be matched, now there is no match, because oniguruma considers the second X
to be part of the invalid sequence.
I think this is a fine behavior; mainly we just want to make sure that we do something consistent.
Wow lots of great changes here. What's the continued role of |
Uh oh. It looks like several of Atom's grammars are relying on the fact that the input is encoded as UTF8; they use explicit UTF8 byte values in hexadecimal. 😢 Here's one example: atom/language-css#99. We could change I'm going to close this out for now and perform this more minimal change instead: #64. |
😿 |
These conversions are cached, as introduced in #46, but we can actually avoid doing them altogether by avoiding transcoding to UTF8 in the first place. This allows us to drop a fair amount of code. More importantly, it allows good performance in the presence of non-ascii characters even without using the
OnigString
API.I've added a few simple APIs to
OnigString
so that it can be dropped-in as a replacement for a regular string more easily:.length
.toString()
.substring
(this is useful for efficiently obtaining the text of a capture group without having to maintain a reference to the original string)I've also converted the CoffeeScript code to plain JS while I was here, and gotten rid of
grunt
./cc @nathansobo @50Wliu @alexandrudima