Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) #274

Merged

Conversation

klondikedragon
Copy link
Contributor

@alecthomas - I wanted to get your feedback early on the approach to add a specialized test to the conformance test suite. It uses a special token (e.g., WBTEST:) to push into a new lexer state that then might have different rules vs the rest of the grammar. This would allow for testing different areas of the lexer without having to add more and more rules to the Root state (which might start interacting with each other over time as the number of conformance test cases grow and the rules get more complex).

Opening as a draft to get your feedback on the approach. This demonstrates how the generated lexer currently doesn't support the word boundary \b regex properly. (Will next see what might be needed to fix this in the lexer generation.)

This also tweaks TestLexerConformanceGenerated so that if you run this test function from VSCode IDE it won't just hang forever (the command that eventually gets run before this change was go test -run TestLexerConformanceGeneratedInternal -tags generated '-test.run=^TestLexerConformanceGenerated$', which confuses go and causes it to hang forever until Ctrl+C is pressed).

Also add conformance tests for these cases
@klondikedragon
Copy link
Contributor Author

Upon further inspection, the previous iteration of the word boundary conformance test was failing because case insensitive matching wasn't implemented. Updated to fix case insensitive matching and add new tests for this case. And then changed the word boundary test to not rely on case insensitive matching.

Also add conformance test for this case
If \b is used at start of pattern, it could match right before the
current position, rather than right after. Check both cases.
@klondikedragon
Copy link
Contributor Author

Identified and fixed another edge case where literal matches were not working if they were at exactly the end of the string.

Also identified and fixed the issue with matching \b if it could match immediately before the current position p instead of right after. This is common in keyword patterns where you might have something like \b(SELECT|FROM|...)\b and want to ensure that both before and after the keyword is a word boundary.

@klondikedragon klondikedragon marked this pull request as ready for review October 26, 2022 06:04
@klondikedragon klondikedragon changed the title Conformance test for word boundary \b Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) Oct 26, 2022
if n == 1 {
fmt.Fprintf(w, "if p < len(s) && s[p] == %q {\n", re.Rune[0])
if re.Flags&syntax.FoldCase != 0 {
fmt.Fprintf(w, "if p+%d <= len(s) && strings.EqualFold(s[p:p+%d], %q) {\n", n, n, string(re.Rune))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Owner

@alecthomas alecthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just a couple of suggestions.

@@ -18,6 +18,9 @@ var conformanceLexer = lexer.MustStateful(lexer.Rules{
"Root": {
{"String", `"`, lexer.Push("String")},
// {"Heredoc", `<<(\w+)`, lexer.Push("Heredoc")},
{"LiteralTest", `LITTEST:`, lexer.Push("LiteralTest")},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah nice, I like this pattern.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind converting the existing tests to use this pattern too?

{"LITKeyword", "SELECT"},
}},
{"LiteralMixed", `LITTEST:hello ONE test LIKE world`, []token{
{"LiteralTest", "LITTEST:"},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make the test harness strip these automatically? ie. the first token iff it ends with TEST:

@@ -117,7 +200,7 @@ func TestLexerConformanceGenerated(t *testing.T) {
args := []string{"test", "-run", "TestLexerConformanceGeneratedInternal", "-tags", "generated"}
// Propagate test flags.
flag.CommandLine.VisitAll(func(f *flag.Flag) {
if f.Value.String() != f.DefValue {
if f.Value.String() != f.DefValue && f.Name != "test.run" {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 🤦‍♂️

@klondikedragon
Copy link
Contributor Author

@alecthomas - I did some cleanup, see what you think. Thanks!

@alecthomas alecthomas merged commit fb225ea into alecthomas:master Oct 26, 2022
@alecthomas
Copy link
Owner

Awesome, thanks :)

@klondikedragon klondikedragon deleted the test/conformance-wordboundary branch October 28, 2022 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants