Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize regex #2172

Merged
merged 3 commits into from
Nov 27, 2017
Merged

Optimize regex #2172

merged 3 commits into from
Nov 27, 2017

Conversation

hvacengi
Copy link
Member

Scanner.cs

kRISC.tpg

  • Remove "^" anchor from EOF pattern, new "\G" anchor is automatically
    generated
  • Eliminate all instances of opening "\b" word boundry anchors since
    white space is always ignored and we always scan where the last token
    left off

CPU.cs

  • Modify compileWatch reset and addition to check if stopwatch is
    running
  • When we moved compiling to a separate thread, this wasn't modified to
    know that the compile may cross multiple updates

YieldFinishedCompile.cs

  • Fixed call to stop the compile stopwatch so that it stops after any
    completed compilation.

Scanner.cs
* Regenerate from TinyPG KSP-KOS/TinyPG#5

kRISC.tpg
* Remove "^" anchor from EOF pattern, new "\G" anchor is automatically
generated
* Remove "\b" word border anchor from "E" regex
kRISC.tpg
* Eliminate all instances of opening "\b" word boundry anchors since
white space is always ignored and we always scan where the last token
left off

Scanner.cs
* Regenerate from TinyPG
This helped validate that the previous fixes actually improved compile
times

CPU.cs
* Modify compileWatch reset and addition to check if stopwatch is
running
* When we moved compiling to a separate thread, this wasn't modified to
know that the compile may cross multiple updates

YieldFinishedCompile.cs
* Fixed call to stop the compile stopwatch so that it stops after any
completed compilation.
REBOOT -> @"reboot\b";
SHUTDOWN -> @"shutdown\b";
FOR -> @"for\b";
UNSET -> @"unset\b";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the removal of the \b for? Did it affect performance in some way?

Matching \b at the start of the search should immediately count as a match with no real work on the part of the regex engine. "Oh this is the start of the text I was told to search through and it's a letter that's here - it matches \b automatically on that criteria alone". That should take no time really.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My decision was two fold:
A) I needed to remove \b from the definition of E because it would otherwise throw an error on the number 1e20 but not 1 e 20. I did not want to inadvertantly cause a similar issue on a pattern that I didn't recognize to have the same issue.
B) Just because the potential performance gain is very small doesn't mean we should ignore it. Particularly since we call the regular expressions many times over large files. And we've had fairly poor performance with Mono's regex engine so far, so anything that is even an incremental improvement could be notable.

Maybe I was overzealous with point A and there are some instances for which a word boundary really should be kept. I did not do extensive testing between having \b and not, so I cannot say if it dramatically improved performance to eliminate it. But If we don't need the character, it shouldn't be there. If you would prefer it remain for the sake of enforcing that the keywords remain preceded by a white space character, but it's worth noting that the \b character will match on more than just white space but also some punctuation and other non-word characters.

I'm open to adding \b back to patterns where appropriate or more clear.

According to https://www.regular-expressions.info/wordboundaries.html there are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

Which means that \b won't just say "Oh this is the start of the text I was told to search through and it's a letter that's here - it matches \b automatically on that criteria alone", but rather "Oh I'm at the beginning or end of the string literal and it's a letter that's here, this matches". If not scanning at the beginning of the string literal, it will still check the character before the current index.

Copy link
Member

@Dunbaratu Dunbaratu Nov 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a bug in how the sci_number , and E rules were implemented, not a problem that warrants changing all the other keywords. Usually it Is wrong to preceed a keyword by a digit with no separation between. The bug is that 'e' was being treated like a keyword when it wasn't.... quite a keyword. (IMO this is why I prefer languages that require sci notation to be all run together without whitespace - it makes it be all cleanly be one token like it properly makes sense to be, and means you don't need to "use up" an identifier on a single letter 'e' that could have been left open for users to use. But too late now.)

As to the notion of not liking the removal of \b everywhere else (especially since once you used the \G flag, it really doesn't change performance anymore to have the \b since you forbade the regex matcher from looking to the left of the start pos with that \G flag), that's just my preference for leaving things where we have to mangle the rule for the sake of execution speed out of the actual kerboscript language definition itself. (i.e how I didn't like inserting the (?: into kRISC.tpg rules, preferring it to be part of the cooking done when building Scanner.cs.) But if there's a strange exception case we have to make for e, then we can't do that here because we can't just remove it from every rule automatically when building Scanner.cs when there actually is a difference. I hadn't known about the problem with e when I made my earlier comment.

So I guess in this case we have no choice but to do it in the kRISC.tpg file because we have to specify e differently from the rest of them and can't do it automatically. The e case means the difference between having the leading \b and not having it is not just a performance optimization, so it can't be done by TinyPG when it builds Scanner.cs. (With the earlier change, with (?:, that was purely a performance trick that had no effect on changing the language at all, so it could be done in TinyPG and made sense to do there instead of in the language def file. This one doesn't sound like it's the same thing so I remove my objection. Here there actually is a difference because e must be treated differently than the rest.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert the change and push unless you would rather just do it in the review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, I'm saying that because of the problem with e, this isn't quite a case where the optimization can be done entirely in TinyPG's code. Not every rule can be treated the same way. That means the optomization has to remain where you put it, in the .tpg file, like it or not.

@Dunbaratu Dunbaratu merged commit 48c8151 into KSP-KOS:develop Nov 27, 2017
@hvacengi hvacengi deleted the optimize-regex branch December 1, 2017 03:03
@Dunbaratu Dunbaratu added this to the v1.1.4.0 milestone Dec 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants