Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Avoid UTF8 to UTF16 character offset conversions #63

Closed
wants to merge 10 commits into from

Conversation

maxbrunsfeld
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld commented Apr 18, 2017

These conversions are cached, as introduced in #46, but we can actually avoid doing them altogether by avoiding transcoding to UTF8 in the first place. This allows us to drop a fair amount of code. More importantly, it allows good performance in the presence of non-ascii characters even without using the OnigString API.

I've added a few simple APIs to OnigString so that it can be dropped-in as a replacement for a regular string more easily:

  • .length
  • .toString()
  • .substring (this is useful for efficiently obtaining the text of a capture group without having to maintain a reference to the original string)

I've also converted the CoffeeScript code to plain JS while I was here, and gotten rid of grunt.

/cc @nathansobo @50Wliu @alexandrudima

@maxbrunsfeld maxbrunsfeld changed the title Avoid expensive UTF8 to UTF16 character offset conversions Avoid UTF8 to UTF16 character offset conversions Apr 19, 2017
String::Utf8Value utf8Value(sources->Get(i));
regExps[i] = shared_ptr<OnigRegExp>(new OnigRegExp(string(*utf8Value)));
Local<String> source = Local<String>::Cast(sources->Get(i));
regExps[i] = shared_ptr<OnigRegExp>(new OnigRegExp(OnigString(source)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the type aliases in onig-searcher.h, and use a fully qualified reference to std::shared_ptr<T> as you've done elsewhere? This file probably never should have relied on those aliases in that header.

PS. Dropping those aliases will obviously require updating the type references in other files, most notably onig-searcher.cc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is my first PR on this repo, and I noticed we had a bunch of using statements in headers. I removed a couple of them in files that I touched, but I don't think I want to update the styling of the rest of the code at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes perfect sense. Can always be done later-ish. Other that this one comment, I reckon this PR looks stellar ✨

Copy link
Contributor Author

@maxbrunsfeld maxbrunsfeld Apr 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving this a 👀 ! Always good to have more eyes on our c++ code.


// Characters after unmatched high surrogates are not found.
match = scanner.findNextMatchSync(`X${String.fromCharCode(0xd83c)}X`, 1)
expect(match).toBeNull()
Copy link
Contributor Author

@maxbrunsfeld maxbrunsfeld Apr 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one behavior that's changed because we no longer transcode to UTF8. When dealing with invalid UTF16 (specifically, an unmatched high surrogate character), the unit that follows will no longer be considered as its own valid character, since it was supposed to contain the other half of the surrogate pair.

Previously, the second X would be matched, now there is no match, because oniguruma considers the second X to be part of the invalid sequence.

I think this is a fine behavior; mainly we just want to make sure that we do something consistent.

@nathansobo
Copy link
Contributor

Wow lots of great changes here. What's the continued role of OnigString?

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Apr 19, 2017

Uh oh. It looks like several of Atom's grammars are relying on the fact that the input is encoded as UTF8; they use explicit UTF8 byte values in hexadecimal. 😢

Here's one example: atom/language-css#99. We could change language-css, but I grepped through the list of third-party packages and found several other usages of these byte literals.

I'm going to close this out for now and perform this more minimal change instead: #64.

@nathansobo
Copy link
Contributor

😿

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants