-
-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peeking lexer optimizations #263
Peeking lexer optimizations #263
Conversation
Welcome back :) These don't seem unreasonable, but do you have any benchmark comparisons if just lexing (vs. lexing+parsing)? |
// Checkpoint wraps the mutable state of the PeekingLexer. | ||
// | ||
// Copying and restoring just this state is a bit faster than copying the entire PeekingLexer. | ||
type Checkpoint struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I try to weigh up the cost of having a permanent public API that will need to be supported "forever" vs. the benefit. In this case a very marginal improvement in performance doesn't seem worth it, so I'd prefer just to copy PeekingLexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think let's get in the macro optimisations first, then worry about micro-optimisations. The generalised downside of micro-optimisations is that they usually make the code less clear, so I'd prefer to avoid them for now.
// Clone creates a clone of this PeekingLexer at its current token. | ||
// | ||
// The parent and clone are completely independent. | ||
func (p *PeekingLexer) Clone() *PeekingLexer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to remove this, we're still in beta!
lexer/api.go
Outdated
@@ -47,6 +47,11 @@ type Lexer interface { | |||
Next() (Token, error) | |||
} | |||
|
|||
// BatchLexer consumes and returns a batch of tokens at a time. A Lexer may choose to implement this for extra speed. | |||
type BatchLexer interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem unreasonable, but how about we put this in a separate PR. Do you have any benchmarks against a lexer before/after implementing this?
@petee-d -- there has been some good progress on getting the codegen lexer into shape, it now follows the pattern suggested by #264 to serialize the lexer definition into JSON and have a separate participle utility program take the JSON and produce the code, and some conformance tests were added with some fixes. The generated lexer is now passing the test suite for my app (few hundred test cases) and delivering 8x lexing speedup and with O(1) allocation. When using your experimental parser codegen branch before, I was able to see similar speedups during parsing, so I'm excited about the possibility of combining the two. :) I'm motivated to contribute and try and help get codegen for parsing across the finish line, and wanted to check in with you and @alecthomas what might make sense as next steps. If you're able to finish rebasing the experimental parser generator branch with the current participle (lots of changes I know, and it looks like you made good progress on this PR!), I could then try to take it from there to adapt to the suggestions in #264. I know you (and Alec!) are doing this work in your personal time, so I'm super grateful for anything you might have time to contribute and understand if that's not possible right now. Let me know and I could also try to take it from where the experimental branch is currently at too. @alecthomas - I'm curious to get your thoughts on this overall direction as well. Thanks! |
Hey guys, yeah, it's been very difficult for me to get capacity to work on this in the last few months, especially since any capacity I could dedicate towards open-source projects went to a different library. But that's now getting less and less demanding and I should have time to finish this. Let's start with this PR. @alecthomas, could you sum up your recommendations as for which of these changes to keep and which to separate out (either to possibly include later, or to completely abandon)? My personal preference after reading your comments would be:
What do you think about each? |
Regarding 1., I just realized I didn't include
I think when I started rebasing my changes (which initially used approach 2.), I decided to instead go for approach 3., but kinda forgot about it. Can't say I care too much about which of these approaches we ultimately pick. I lean towards 3. at the moment, I guess, seems the most elegant, plus has that tiny performance improvement of not copying immutable PeekingLexer fields? By the way, I finally found time for coding today and I finished resolving conflicts against the current participle master - I gave up on using git merge (the conflicts were just too massive and difficult to read) and instead manually copy-pasted the changes from the original branch diff - was actually faster that way. :D My test suite for the Jinja template language is now passing. I'm not even close to done though, as I still need to implement support for the 2 newly added node types ( |
@petee-d - it's exciting you've got some time to spend on this effort, and that you have finished an initial rebasing! Thank you! If you publish your updated branch, I could give it a try on my grammar / test suite as well. I'm not using the custom/union node types, so it feels there is a chance it could pass the test suite in its current state. :) @alecthomas is certainly the authority so I'm only offering my 0.02 on the other items :) Items 2, 3, and 4 feel like pretty "local" changes and so there perhaps isn't a big debate on whether or not to accept them. Item 2 avoiding an allocation in a hot path sounds pretty valuable. Items 3 and 4 it's hard to tell if they affect the parser-level benchmark or not, but it is just changing a few internal functions of PeekingLexer for the most part. It seems like the main open question with this PR is item 1 and whether or not to just copy the whole PeekingLexer struct or have this separate Checkpoint struct. A slice header is a pointer plus two integers, and a map I believe is just a pointer. So splitting out the struct is saving copying maybe a few extra words vs approach 1. Is that generating any noticeable difference in macro (parser-level) benchmarks? If not, it sounds like Alec preferred the readability/maintainability of just copying the whole struct. If it does have a macro-effect that is major, then perhaps it's worth the change. I'm just thinking that a few % points of CPU speed may not be a big thing to include (vs maintainability cost of another public API on PeekingLexer) if the generated parser gets 10x-20x speedup (and gen lexer gets 8x+ vs reflection). |
@klondikedragon I ran a benchmark (parsing only on Jinja) and the Checkpoint optimization was actually even more impactful than I remembered. Benchmarks were run with
Unsurprisingly, 2. and 3. were equal as the
I got curious and tried adding an unused map or a slice to Checkpoint, the slice actually slowed it down the most, noticeably more than adding 24 bytes of non-pointer data. Not entirely sure what goes on on the background, but maybe Go has to be more careful manipulating pointer data due to garbage collection? (there probably is a reason why there is a separate Anyway, I wouldn't say performance is the main reason why I prefer approach 3. - I just find it more readable. The maintainability cost isn't really that high, it's just a new public type and new |
@petee-d -- pretty interesting the copy of the slice was the biggest source of slowdown in copying the struct, I wouldn't have guessed that! It makes me curious to learn more about how Go does GC. Apparently there is a build flag where it will output escape analysis info, @alecthomas - @petee-d makes a compelling case for the "Checkpoint" style modification to |
c2ffb31
to
fce21ae
Compare
I pushed the rebased branch with the last commit (BatchLexer) dropped, so this is ready to go if @alecthomas agrees with the plan. |
I like the checkpoint approach, works for me 👍
I prefer this approach because it makes the API more explicit.
I think I'd prefer to either switch
👍 |
I.e. quite significant. Changing |
I have the pointer conversion done here: #283 |
I'll merge that in. |
Thanks for doing the benchmarks BTW 🙏 |
The state itself is still in private fields, thus Checkpoint being a public field doesn't break encapsulation.
Roughly 5% less CPU time and 12% fewer allocations in the Thrift benchmark with generated lexer.
The attached Benchmark is over 3x faster after the optimization
2ea71d3
to
e90a87f
Compare
Forgot to comment that I updated the pull request after rebasing to your changes to the PeekingLexer (not a pleasant conflict, I wished you did this after merging this PR 😅). IMHO it's ready for merging. |
@alecthomas Do you agree this can be merged? If so, can you? Writing tests revealed quite a lot of differences between errors & partial parse trees reported by the reflective parser and my generated code. But I'm slowly getting through them, I fixed the vast majority of them by modifying the code generator and a few by improving participle itself (small changes of functionality that only affect edge cases for error reporting and should be beneficial). I have my WIP code at https://github.com/alecthomas/participle/compare/master...petee-d:parser-code-generation?expand=1 and it's not yet ready for review (there will be some refactoring of the code generator and it needs documenting once it stabilizes). But what I would like to you to look at is the public API for generating parser code and using it. I documented it here, you can also see it in action in the Thrift parser. The point of the whole custom type embedding
The last line combines the lexer generated with current master and my generated parser. Lexing is the slow part here, taking around 70% of the time. |
Oh and @klondikedragon, I would love to hear your feedback after trying it on your grammar as well. Especially any bugs or usability issues you notice. |
Nice! |
@petee-d @alecthomas - exciting progress! I've tried out the WIP parsergen code with nice results, I'll post details to #213 shortly. |
Hi @alecthomas,
when rebasing #213 to master, I decided it would be best to commit my stuff in smaller chunks. The first chunk is some optimizations I did to
PeekingLexer
as it was becoming the bottleneck to the generated parser implementation.Optimizing
PeekingLexer.Clone
-Checkpoint
It does an allocation that could be avoided. One way is to just have
parseContext
and my generated code to store it by value (keeping pointer receiver for its methods though). This already helps quite a lot, but I went a bit further in my generated code - wrapping the internal cursors in a publiclexer.Checkpoint
type. When a branch needs a backup of thePeekingLexer
state, it will just need to copy this small struct and restore it only if the branch was not accepted and it's even a little bit faster than copying & replacing the entirePeekingLexer
struct.Optimizing
PeekingLexer.Clone
- removing itUsing the
Checkpoint
pattern in the reflection parser is tough as it needs to explicitly apply an accepted branch instead of reverting to a backup of a failed branch (the way that was nicer to use in the generated code). So my proposal would be to keep theCheckpoint
pattern only for generated code. But I added a commit (independent of my other changes) that I can include if you like that storesPeekingLexer
inparseContext
by value and avoids theClone
allocation altogether.Optimizing
PeekingLexer.Peek
Currently
Peek
(which is used quite often by the reflective parser and even more often by my generated parser) needs to skip elided tokens every time it's called. It also checks for the length ofPeekingLexer.tokens
and returnsp.eof
in a lot of places - I think it's more elegant to just have the EOF token as the last token and preventNext
from advancing beyond it.This optimization does this work once in
Upgrade
andNext
and makesPeek
super simple, it also removes theeof
field. The performance improvement for the Thrift benchmark is barely noticeable, but a separate benchmark I added shows a >3x improvement forPeek
calls; it was also more more noticeable in my generated parser.I mentioned this whole thing in point 2 in Generating parser code #213 (comment) as the initial implementation changed the behavior of
RawCursor
and thus alsoRange
. This new implementation adds a separate cursor for knowing when the next non-elided token is, making it a bit slower, but prevents a change of behavior.Working with
lexer.Token
as pointerToken
is not a small struct and in most cases it's more efficient to work with it as a pointer instead of copying it - assuming of course no allocations are needed as the tokens are already stored inPeekingLexer.tokens
. ChangingToken
methods to a pointer receiver alone improved CPU perf in the Thrift benchmark by ~2.5%, trying to makeNext
,Peek
and other methods return it as a pointer made the improvement ~4%. Such aConsidering this would be a breaking change to a public (somewhat internal, but still public) API, I decided it wasn't worth it. But if you think it's OK, I still added a new method
Current
, which is a version ofPeek
that returns a pointer.BenchmarkPeekingLexer_Current
still shows that almost 3x faster thanPeek
. If you don't like it, I don't need to include it. I also used pointers at least inside thePeekingLexer
methods and got a tiny CPU improvement in the Thrift benchmark out of it.BatchLexer
interfaceThis is actually unrelated to the code generation, but when playing with a manually written lexer for Jinja, I found it both more convenient and more efficient (thanks to a much lower number of function calls and more efficient allocations) to have the ability to yield multiple (in my case up to 50) tokens at a time. The interface and the related implementation is just a soft suggestion, if you don't like it, I can remove it. But it's optional - if a
Lexer
chooses to implement it, it can gain extra performance, if it doesn't, no problem.After all but the last optimization (
BatchLexer
isn't used by any built-in lexer - yet), the final comparison looks like:Perhaps even more important is that according to a profiler, now it spends almost 7x less time in the
PeekingLexer
methods which I intended to optimize.