Optimize regex #2172

hvacengi · 2017-11-17T03:17:14Z

Scanner.cs

Regenerate from TinyPG Update regex generation and scanning TinyPG#5

kRISC.tpg

Remove "^" anchor from EOF pattern, new "\G" anchor is automatically
generated
Eliminate all instances of opening "\b" word boundry anchors since
white space is always ignored and we always scan where the last token
left off

CPU.cs

Modify compileWatch reset and addition to check if stopwatch is
running
When we moved compiling to a separate thread, this wasn't modified to
know that the compile may cross multiple updates

YieldFinishedCompile.cs

Fixed call to stop the compile stopwatch so that it stops after any
completed compilation.

Scanner.cs * Regenerate from TinyPG KSP-KOS/TinyPG#5 kRISC.tpg * Remove "^" anchor from EOF pattern, new "\G" anchor is automatically generated * Remove "\b" word border anchor from "E" regex

kRISC.tpg * Eliminate all instances of opening "\b" word boundry anchors since white space is always ignored and we always scan where the last token left off Scanner.cs * Regenerate from TinyPG

This helped validate that the previous fixes actually improved compile times CPU.cs * Modify compileWatch reset and addition to check if stopwatch is running * When we moved compiling to a separate thread, this wasn't modified to know that the compile may cross multiple updates YieldFinishedCompile.cs * Fixed call to stop the compile stopwatch so that it stops after any completed compilation.

Dunbaratu · 2017-11-17T05:01:44Z

src/kOS.Safe/Compilation/KS/kRISC.tpg

+REBOOT       -> @"reboot\b";
+SHUTDOWN     -> @"shutdown\b";
+FOR          -> @"for\b";
+UNSET        -> @"unset\b";


What was the removal of the \b for? Did it affect performance in some way?

Matching \b at the start of the search should immediately count as a match with no real work on the part of the regex engine. "Oh this is the start of the text I was told to search through and it's a letter that's here - it matches \b automatically on that criteria alone". That should take no time really.

My decision was two fold:
A) I needed to remove \b from the definition of E because it would otherwise throw an error on the number 1e20 but not 1 e 20. I did not want to inadvertantly cause a similar issue on a pattern that I didn't recognize to have the same issue.
B) Just because the potential performance gain is very small doesn't mean we should ignore it. Particularly since we call the regular expressions many times over large files. And we've had fairly poor performance with Mono's regex engine so far, so anything that is even an incremental improvement could be notable.

Maybe I was overzealous with point A and there are some instances for which a word boundary really should be kept. I did not do extensive testing between having \b and not, so I cannot say if it dramatically improved performance to eliminate it. But If we don't need the character, it shouldn't be there. If you would prefer it remain for the sake of enforcing that the keywords remain preceded by a white space character, but it's worth noting that the \b character will match on more than just white space but also some punctuation and other non-word characters.

I'm open to adding \b back to patterns where appropriate or more clear.

According to https://www.regular-expressions.info/wordboundaries.html there are three different positions that qualify as word boundaries:

Before the first character in the string, if the first character is a word character.

After the last character in the string, if the last character is a word character.

Between two characters in the string, where one is a word character and the other is not a word character.

Which means that \b won't just say "Oh this is the start of the text I was told to search through and it's a letter that's here - it matches \b automatically on that criteria alone", but rather "Oh I'm at the beginning or end of the string literal and it's a letter that's here, this matches". If not scanning at the beginning of the string literal, it will still check the character before the current index.

That sounds like a bug in how the sci_number , and E rules were implemented, not a problem that warrants changing all the other keywords. Usually it Is wrong to preceed a keyword by a digit with no separation between. The bug is that 'e' was being treated like a keyword when it wasn't.... quite a keyword. (IMO this is why I prefer languages that require sci notation to be all run together without whitespace - it makes it be all cleanly be one token like it properly makes sense to be, and means you don't need to "use up" an identifier on a single letter 'e' that could have been left open for users to use. But too late now.)

As to the notion of not liking the removal of \b everywhere else (especially since once you used the \G flag, it really doesn't change performance anymore to have the \b since you forbade the regex matcher from looking to the left of the start pos with that \G flag), that's just my preference for leaving things where we have to mangle the rule for the sake of execution speed out of the actual kerboscript language definition itself. (i.e how I didn't like inserting the (?: into kRISC.tpg rules, preferring it to be part of the cooking done when building Scanner.cs.) But if there's a strange exception case we have to make for e, then we can't do that here because we can't just remove it from every rule automatically when building Scanner.cs when there actually is a difference. I hadn't known about the problem with e when I made my earlier comment.

So I guess in this case we have no choice but to do it in the kRISC.tpg file because we have to specify e differently from the rest of them and can't do it automatically. The e case means the difference between having the leading \b and not having it is not just a performance optimization, so it can't be done by TinyPG when it builds Scanner.cs. (With the earlier change, with (?:, that was purely a performance trick that had no effect on changing the language at all, so it could be done in TinyPG and made sense to do there instead of in the language def file. This one doesn't sound like it's the same thing so I remove my objection. Here there actually is a difference because e must be treated differently than the rest.)

I can revert the change and push unless you would rather just do it in the review

nah, I'm saying that because of the problem with e, this isn't quite a case where the optimization can be done entirely in TinyPG's code. Not every rule can be treated the same way. That means the optomization has to remain where you put it, in the .tpg file, like it or not.

hvacengi added 3 commits November 16, 2017 22:06

Regenerate Scanner

733a14c

Scanner.cs * Regenerate from TinyPG KSP-KOS/TinyPG#5 kRISC.tpg * Remove "^" anchor from EOF pattern, new "\G" anchor is automatically generated * Remove "\b" word border anchor from "E" regex

Eliminate more \b

1bf93c7

kRISC.tpg * Eliminate all instances of opening "\b" word boundry anchors since white space is always ignored and we always scan where the last token left off Scanner.cs * Regenerate from TinyPG

Dunbaratu reviewed Nov 17, 2017

View reviewed changes

Dunbaratu merged commit 48c8151 into KSP-KOS:develop Nov 27, 2017

hvacengi deleted the optimize-regex branch December 1, 2017 03:03

Dunbaratu added this to the v1.1.4.0 milestone Dec 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize regex #2172

Optimize regex #2172

hvacengi commented Nov 17, 2017

Dunbaratu Nov 17, 2017

hvacengi Nov 17, 2017

Dunbaratu Nov 17, 2017 •

edited

Loading

hvacengi Nov 18, 2017

Dunbaratu Nov 18, 2017

Optimize regex #2172

Optimize regex #2172

Conversation

hvacengi commented Nov 17, 2017

Dunbaratu Nov 17, 2017

Choose a reason for hiding this comment

hvacengi Nov 17, 2017

Choose a reason for hiding this comment

Dunbaratu Nov 17, 2017 • edited Loading

Choose a reason for hiding this comment

hvacengi Nov 18, 2017

Choose a reason for hiding this comment

Dunbaratu Nov 18, 2017

Choose a reason for hiding this comment

Dunbaratu Nov 17, 2017 •

edited

Loading