Avoid UTF8 to UTF16 character offset conversions #63

maxbrunsfeld · 2017-04-18T22:25:18Z

These conversions are cached, as introduced in #46, but we can actually avoid doing them altogether by avoiding transcoding to UTF8 in the first place. This allows us to drop a fair amount of code. More importantly, it allows good performance in the presence of non-ascii characters even without using the OnigString API.

I've added a few simple APIs to OnigString so that it can be dropped-in as a replacement for a regular string more easily:

.length
.toString()
.substring (this is useful for efficiently obtaining the text of a capture group without having to maintain a reference to the original string)

I've also converted the CoffeeScript code to plain JS while I was here, and gotten rid of grunt.

/cc @nathansobo @50Wliu @alexandrudima

Just use oniguruma's native UTF16 support.

thomasjo · 2017-04-19T06:51:57Z

src/onig-scanner.cc

-    String::Utf8Value utf8Value(sources->Get(i));
-    regExps[i] = shared_ptr<OnigRegExp>(new OnigRegExp(string(*utf8Value)));
+    Local<String> source = Local<String>::Cast(sources->Get(i));
+    regExps[i] = shared_ptr<OnigRegExp>(new OnigRegExp(OnigString(source)));


Drop the type aliases in onig-searcher.h, and use a fully qualified reference to std::shared_ptr<T> as you've done elsewhere? This file probably never should have relied on those aliases in that header.

PS. Dropping those aliases will obviously require updating the type references in other files, most notably onig-searcher.cc.

Yeah, this is my first PR on this repo, and I noticed we had a bunch of using statements in headers. I removed a couple of them in files that I touched, but I don't think I want to update the styling of the rest of the code at the moment.

Yeah, that makes perfect sense. Can always be done later-ish. Other that this one comment, I reckon this PR looks stellar ✨

Thanks for giving this a 👀 ! Always good to have more eyes on our c++ code.

maxbrunsfeld · 2017-04-19T07:09:37Z

spec/onig-scanner-spec.js

+
+      // Characters after unmatched high surrogates are not found.
+      match = scanner.findNextMatchSync(`X${String.fromCharCode(0xd83c)}X`, 1)
+      expect(match).toBeNull()


This is the one behavior that's changed because we no longer transcode to UTF8. When dealing with invalid UTF16 (specifically, an unmatched high surrogate character), the unit that follows will no longer be considered as its own valid character, since it was supposed to contain the other half of the surrogate pair.

Previously, the second X would be matched, now there is no match, because oniguruma considers the second X to be part of the invalid sequence.

I think this is a fine behavior; mainly we just want to make sure that we do something consistent.

nathansobo · 2017-04-19T20:02:24Z

Wow lots of great changes here. What's the continued role of OnigString?

maxbrunsfeld · 2017-04-19T22:08:06Z

Uh oh. It looks like several of Atom's grammars are relying on the fact that the input is encoded as UTF8; they use explicit UTF8 byte values in hexadecimal. 😢

Here's one example: atom/language-css#99. We could change language-css, but I grepped through the list of third-party packages and found several other usages of these byte literals.

I'm going to close this out for now and perform this more minimal change instead: #64.

nathansobo · 2017-04-19T22:11:00Z

😿

maxbrunsfeld added 7 commits April 18, 2017 14:08

Convert coffee-script files to JS, remove grunt

ca51fc3

Convert specs to JavaScript

b4bdb3b

Avoid converting between UTF8 and UTF16 character offsets

58abcfb

Just use oniguruma's native UTF16 support.

Add a length attribute to OnigString

35bcbb2

Don't test against ancient node versions on travis

5acfd0d

Add toString and substring methods to OnigString

aba0b92

6.2.0-0

402968b

maxbrunsfeld mentioned this pull request Apr 18, 2017

Use caching in oniguruma atom/first-mate#94

Merged

maxbrunsfeld added 2 commits April 18, 2017 16:50

Put back source property on OnigRegExp

f13fde9

Guard against non-strings passed to OnigString

1bcb852

maxbrunsfeld changed the title ~~Avoid expensive UTF8 to UTF16 character offset conversions~~ Avoid UTF8 to UTF16 character offset conversions Apr 19, 2017

thomasjo reviewed Apr 19, 2017

View reviewed changes

maxbrunsfeld commented Apr 19, 2017

View reviewed changes

6.2.0-1

c95845c

maxbrunsfeld closed this Apr 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid UTF8 to UTF16 character offset conversions #63

Avoid UTF8 to UTF16 character offset conversions #63

maxbrunsfeld commented Apr 18, 2017 •

edited

Loading

thomasjo Apr 19, 2017

maxbrunsfeld Apr 19, 2017

thomasjo Apr 19, 2017

maxbrunsfeld Apr 19, 2017 •

edited

Loading

maxbrunsfeld Apr 19, 2017 •

edited

Loading

nathansobo commented Apr 19, 2017

maxbrunsfeld commented Apr 19, 2017 •

edited

Loading

nathansobo commented Apr 19, 2017

Avoid UTF8 to UTF16 character offset conversions #63

Avoid UTF8 to UTF16 character offset conversions #63

Conversation

maxbrunsfeld commented Apr 18, 2017 • edited Loading

thomasjo Apr 19, 2017

Choose a reason for hiding this comment

maxbrunsfeld Apr 19, 2017

Choose a reason for hiding this comment

thomasjo Apr 19, 2017

Choose a reason for hiding this comment

maxbrunsfeld Apr 19, 2017 • edited Loading

Choose a reason for hiding this comment

maxbrunsfeld Apr 19, 2017 • edited Loading

Choose a reason for hiding this comment

nathansobo commented Apr 19, 2017

maxbrunsfeld commented Apr 19, 2017 • edited Loading

nathansobo commented Apr 19, 2017

maxbrunsfeld commented Apr 18, 2017 •

edited

Loading

maxbrunsfeld Apr 19, 2017 •

edited

Loading

maxbrunsfeld Apr 19, 2017 •

edited

Loading

maxbrunsfeld commented Apr 19, 2017 •

edited

Loading