Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A case of parser backtracking not being supported #138

Open
ForNeVeR opened this issue Feb 13, 2022 · 3 comments
Open

A case of parser backtracking not being supported #138

ForNeVeR opened this issue Feb 13, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@ForNeVeR
Copy link

ForNeVeR commented Feb 13, 2022

Disclaimer

Once again, mind that I never had any formal education in language parsing, so I may ask of strange / unrealisting things, and please feel free to correct me in anything I say.

Describe the bug

Let's consider this EBNF sample, an excerpt of the actual C17 grammar with everything unrelated stripped.

function_definition: declaration_specifiers declarator
declaration_specifiers: type_specifier declaration_specifiers?
declarator: direct_declarator
direct_declarator: Identifier
type_specifier: 'int'
type_specifier: Identifier

So, a function_definition may be boiled down to a sequence of type_specifiers followed by a single direct_declarator (which is a Identifier). So far, so good.

This synthetic sample should pass through this parser:

int main

It has one type_specifier, namely int, and then a declarator which is direct_declarator which is main.

Unfortunately, I wasn't able to make Yoakke to parse this sample.

Here's my program:

using Yoakke.SynKit.C.Syntax;
using Yoakke.SynKit.Lexer;
using Yoakke.SynKit.Parser.Attributes;

namespace Foo;

public record FunctionDefinition(List<IDeclarationSpecifier> Specifiers, Declarator Declarator);

public interface IDeclarationSpecifier
{
}

public record Declarator(DirectDeclarator DirectDeclarator);

public record DirectDeclarator(string Text);

public record TypeSpecifier(string Name) : IDeclarationSpecifier;

[Parser(typeof(CTokenType))]
public partial class CParser
{
    [Rule("function_definition: declaration_specifiers declarator")]
    private static FunctionDefinition MakeFunctionDefinition(
        List<IDeclarationSpecifier> specifiers,
        Declarator declarator) => new(specifiers, declarator);

    [Rule("declaration_specifiers: type_specifier declaration_specifiers?")]
    private static List<IDeclarationSpecifier> MakeDeclarationSpecifiers(
        IDeclarationSpecifier typeSpecifier,
        List<IDeclarationSpecifier>? rest) =>
        rest?.Prepend(typeSpecifier).ToList() ?? new List<IDeclarationSpecifier> { typeSpecifier };
    
    [Rule("declarator: direct_declarator")]
    private static Declarator MakeDeclarator(DirectDeclarator directDeclarator) =>
        new(directDeclarator);

    [Rule("direct_declarator: Identifier")]
    private static DirectDeclarator MakeDirectDeclarator(IToken identifier) =>
        new DirectDeclarator(identifier.Text);
    
    [Rule("type_specifier: 'int'")]
    [Rule("type_specifier: Identifier")]
    private static TypeSpecifier MakeSimpleTypeSpecifier(IToken specifier) => new(specifier.Text);
}


public class Program
{
    public static void Main(string[] args)
    {
        var parser = new CParser(new CLexer("int main"));
        Console.WriteLine(parser.ParseFunctionDefinition().Ok);
    }
}

For convenience, here's also a .csproj:

<Project Sdk="Microsoft.NET.Sdk">

    <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>net6.0</TargetFramework>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
    </PropertyGroup>

    <ItemGroup>
      <PackageReference Include="Yoakke.SynKit.C.Syntax" Version="2022.1.24-2.29.33-nightly" />
      <PackageReference Include="Yoakke.SynKit.Parser" Version="2022.1.24-2.29.33-nightly" />
      <PackageReference Include="Yoakke.SynKit.Parser.Generator" Version="2022.1.24-2.29.33-nightly" />
    </ItemGroup>

</Project>

I expect that this program should print the following:

Yoakke.SynKit.Parser.ParseOk`1[Foo.FunctionDefinition]

Instead, it prints this:

Unhandled exception. System.InvalidCastException: Unable to cast object of type 'Yoakke.SynKit.Parser.ParseError' to type 'Yoakke.SynKit.Parser.ParseOk`1[Foo.FunctionDefinition]'.
   at Yoakke.SynKit.Parser.ParseResult`1.get_Ok()
   at Foo.Program.Main(String[] args) in T:\Temp\ConsoleApp4\ConsoleApp4\Program.cs:line 52

Analysis

I have thoroughly read and, I believe, understood the generated code, and I believe that the following is happening.

Yoakke eagerly eats the declaration_specifiers collection, and eats both tokens: int and main as its items. Then it tries to parse a declarator, but isn't able to do so, because it's out of tokens already.

Then, it's unable to drop the latest item from the declaration_specifiers and retry the declarator, though it would be the winning strategy here.

Unfortunately, I don't know a workaround for this problem, so I would appreciate any feedback. If there's any hacky/ugly way to fix this, I'd love to hear it. Obviously, I would love to hear if there's an elegant solution to this problem, too :)

Environment

  • OS: Windows 10
  • .NET version: .NET 6
  • Library version: 2022.1.24-2.29.33-nightly
@ForNeVeR ForNeVeR added the bug Something isn't working label Feb 13, 2022
@LPeter1997
Copy link
Contributor

Ah, the weakness of naive recursive-descent parsing is starting to bleed through. When the parser comes to a decision, it simply goes with the path that consumes the most input and will go with that decision greedily, never questioning that decision ever again.

Unfortunately I can't tell you a nice workaround yet, this would require changing the parsing strategies (which is planned) to something that can either backtrack, or keep nondeterministic state (something like GLR). The good news is that you can still work around it manually. I've added a patch to be able to define a custom parser function. This was planned anyway, so it was just a matter of exposing the feature.

Since it's highly undocumented, I've implemented your exact case here: https://github.com/LanguageDev/Yoakke/blob/master/Sources/SynKit/Tests/Parser.Tests/Issue138Tests.cs#L64

The gist of custom parsers is:

  • You annotate a method with [CustomParser("rule-name")] to tell the generator that your annotated method is a parser you'd like to implement yourself
  • The method has to be an instance method
  • The method takes an int parameter, which is the current offset in the token stream
  • The return type has to be a ParseResult<T>

The custom parser still calls into existing rule parsers, so you can still reuse working parser rules. I've transformed your grammar slightly to be able to backtrack it a bit easier. Essentially

declaration_specifiers : type_specifier declaration_specifiers?

becomes

declaration_specifiers : type_specifier+

The custom parser consumes as many type specifiers as possible, saving each offset in the token stream. After that we go backwards and try to parse a declarator from the furthest offset that yields a success.

Thanks for reporting this issue! This is sadly something I've suspected and should have provided a proper solution long ago (let this be a GLR backend or whatever). Let me know if you have any more questions. I know this is far from an ideal or pretty solution, I'll try to get in a better parser eventually. Please do leave a feedback if this solved your issue!

Note: I'm pushing out an early nightly just in case, but it will take some time for NuGet to detect it.

ForNeVeR added a commit to ForNeVeR/Cesium that referenced this issue Feb 14, 2022
@ForNeVeR
Copy link
Author

ForNeVeR commented Feb 15, 2022

@LPeter1997, thank you so much for your help.

I was able to follow your example, though in a real C parser it got much more… messy.

Roughly all occurrences of the word "HACK" in these commits are related to the problem. It is caused by the fact that these declaration_specifiers are used in several places in C structure declared by the standard, and it's hard to generalize the code (so I have to just write new custom parsers for every declaration node).

Yet I very much appreciate the solution and the update to the library which allowed to declare custom parsers for certain nodes; it has unblocked me in my endeavour.

I am leaving the issue open for the future when the underlying parser generator in Yoakke will be rewritten and the problem will go away, but I understand it may be a major change which I shouldn't expect to see soon.

As usual, it was a pleasure to work with you and get a detailed – and working! – workaround for my problem.

@LPeter1997
Copy link
Contributor

Hey just a heads up, a new backend for both the lexer and parser are in progress that will hopefully fix most issues.

There are a couple of real-world regexes that fail to parse so the new lexer will be somewhat PCRE compatible and the parser will actually work with raw CFGs in the background, lifting a couple of annoyances. One that affects you would be the optional left-recursion (#121), the other would be (hopefully) this one. The new backend is still kind of early in progress, but will be able to generate parsers with multiple strategies (like the classic LR approaches).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants