-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A case of parser backtracking not being supported #138
Comments
Ah, the weakness of naive recursive-descent parsing is starting to bleed through. When the parser comes to a decision, it simply goes with the path that consumes the most input and will go with that decision greedily, never questioning that decision ever again. Unfortunately I can't tell you a nice workaround yet, this would require changing the parsing strategies (which is planned) to something that can either backtrack, or keep nondeterministic state (something like GLR). The good news is that you can still work around it manually. I've added a patch to be able to define a custom parser function. This was planned anyway, so it was just a matter of exposing the feature. Since it's highly undocumented, I've implemented your exact case here: https://github.com/LanguageDev/Yoakke/blob/master/Sources/SynKit/Tests/Parser.Tests/Issue138Tests.cs#L64 The gist of custom parsers is:
The custom parser still calls into existing rule parsers, so you can still reuse working parser rules. I've transformed your grammar slightly to be able to backtrack it a bit easier. Essentially
becomes
The custom parser consumes as many type specifiers as possible, saving each offset in the token stream. After that we go backwards and try to parse a declarator from the furthest offset that yields a success. Thanks for reporting this issue! This is sadly something I've suspected and should have provided a proper solution long ago (let this be a GLR backend or whatever). Let me know if you have any more questions. I know this is far from an ideal or pretty solution, I'll try to get in a better parser eventually. Please do leave a feedback if this solved your issue! Note: I'm pushing out an early nightly just in case, but it will take some time for NuGet to detect it. |
@LPeter1997, thank you so much for your help. I was able to follow your example, though in a real C parser it got much more… messy. Roughly all occurrences of the word "HACK" in these commits are related to the problem. It is caused by the fact that these Yet I very much appreciate the solution and the update to the library which allowed to declare custom parsers for certain nodes; it has unblocked me in my endeavour. I am leaving the issue open for the future when the underlying parser generator in Yoakke will be rewritten and the problem will go away, but I understand it may be a major change which I shouldn't expect to see soon. As usual, it was a pleasure to work with you and get a detailed – and working! – workaround for my problem. |
Hey just a heads up, a new backend for both the lexer and parser are in progress that will hopefully fix most issues. There are a couple of real-world regexes that fail to parse so the new lexer will be somewhat PCRE compatible and the parser will actually work with raw CFGs in the background, lifting a couple of annoyances. One that affects you would be the optional left-recursion (#121), the other would be (hopefully) this one. The new backend is still kind of early in progress, but will be able to generate parsers with multiple strategies (like the classic LR approaches). |
Disclaimer
Once again, mind that I never had any formal education in language parsing, so I may ask of strange / unrealisting things, and please feel free to correct me in anything I say.
Describe the bug
Let's consider this EBNF sample, an excerpt of the actual C17 grammar with everything unrelated stripped.
So, a
function_definition
may be boiled down to a sequence oftype_specifier
s followed by a singledirect_declarator
(which is aIdentifier
). So far, so good.This synthetic sample should pass through this parser:
It has one
type_specifier
, namelyint
, and then adeclarator
which isdirect_declarator
which ismain
.Unfortunately, I wasn't able to make Yoakke to parse this sample.
Here's my program:
For convenience, here's also a
.csproj
:I expect that this program should print the following:
Instead, it prints this:
Analysis
I have thoroughly read and, I believe, understood the generated code, and I believe that the following is happening.
Yoakke eagerly eats the
declaration_specifiers
collection, and eats both tokens:int
andmain
as its items. Then it tries to parse adeclarator
, but isn't able to do so, because it's out of tokens already.Then, it's unable to drop the latest item from the
declaration_specifiers
and retry thedeclarator
, though it would be the winning strategy here.Unfortunately, I don't know a workaround for this problem, so I would appreciate any feedback. If there's any hacky/ugly way to fix this, I'd love to hear it. Obviously, I would love to hear if there's an elegant solution to this problem, too :)
Environment
The text was updated successfully, but these errors were encountered: