Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parser] Simplify the lexer interface #6319

Merged
merged 1 commit into from
Feb 20, 2024
Merged

Conversation

tlively
Copy link
Member

@tlively tlively commented Feb 17, 2024

The lexer was previously an iterator over tokens, but that expressivity is not
actually used in the parser. Instead, we have input.h that adapts the token
iterator interface into an iterface that is actually useful.

As a first step toward simplifying the lexer implementation to no longer be an
iterator over tokens, update its interface by moving the adaptation from input.h
to the lexer itself. This requires extensive changes to the lexer unit tests,
which will not have to change further when we actually simplify the lexer
implementation.

The lexer was previously an iterator over tokens, but that expressivity is not
actually used in the parser. Instead, we have `input.h` that adapts the token
iterator interface into an iterface that is actually useful.

As a first step toward simplifying the lexer implementation to no longer be an
iterator over tokens, update its interface by moving the adaptation from input.h
to the lexer itself. This requires extensive changes to the lexer unit tests,
which will not have to change further when we actually simplify the lexer
implementation.
@tlively tlively requested a review from kripken February 17, 2024 01:55
@tlively
Copy link
Member Author

tlively commented Feb 17, 2024

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

EXPECT_FALSE(Lexer("18446744073709551616"sv).takeI64());

EXPECT_FALSE(Lexer("+9223372036854775807"sv).takeU64());
EXPECT_EQ(Lexer("+9223372036854775807"sv).takeI64(), INT64_MAX);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble mapping these two lines to the old tests. Based on the constant, I looked at

  {
    Lexer lexer("+9223372036854775807"sv);
    ASSERT_FALSE(lexer.empty());
    Token expected{"+9223372036854775807"sv, IntTok{INT64_MAX, Pos}};
    EXPECT_EQ(*lexer, expected);
  }

But the first part looks different? Before we checked the lexer was not empty, and now we check that takeU64 is false (why is it false?)

Copy link
Member Author

@tlively tlively Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

takeU* can never succeed when the token starts with + or - because unsigned numbers should not have signs. takeI*, on the other hand, falls back to parsing the number as signed then reinterpreting it to be unsigned, so that succeeds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and is that a new test compared to before? Or am I not reading the old code right?

Copy link
Member Author

@tlively tlively Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old test was testing what happened before takeU* or takeI* would have been called. The behavior of takeU* has not changed, but that's not what was being tested before. The successful tokenization part that was previously tested is still happening, but now it's an internal implementation detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

@tlively
Copy link
Member Author

tlively commented Feb 20, 2024

Going to go ahead and land this despite the unrelated emscripten failure.

@tlively tlively merged commit c0cdd26 into main Feb 20, 2024
14 of 15 checks passed
@tlively tlively deleted the parser-simplify-lexer-interface branch February 20, 2024 21:08
radekdoulik pushed a commit to dotnet/binaryen that referenced this pull request Jul 12, 2024
The lexer was previously an iterator over tokens, but that expressivity is not
actually used in the parser. Instead, we have `input.h` that adapts the token
iterator interface into an iterface that is actually useful.

As a first step toward simplifying the lexer implementation to no longer be an
iterator over tokens, update its interface by moving the adaptation from input.h
to the lexer itself. This requires extensive changes to the lexer unit tests,
which will not have to change further when we actually simplify the lexer
implementation.
@gkdn gkdn mentioned this pull request Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants