Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) #274

klondikedragon · 2022-10-26T03:12:41Z

@alecthomas - I wanted to get your feedback early on the approach to add a specialized test to the conformance test suite. It uses a special token (e.g., WBTEST:) to push into a new lexer state that then might have different rules vs the rest of the grammar. This would allow for testing different areas of the lexer without having to add more and more rules to the Root state (which might start interacting with each other over time as the number of conformance test cases grow and the rules get more complex).

Opening as a draft to get your feedback on the approach. This demonstrates how the generated lexer currently doesn't support the word boundary \b regex properly. (Will next see what might be needed to fix this in the lexer generation.)

This also tweaks TestLexerConformanceGenerated so that if you run this test function from VSCode IDE it won't just hang forever (the command that eventually gets run before this change was go test -run TestLexerConformanceGeneratedInternal -tags generated '-test.run=^TestLexerConformanceGenerated$', which confuses go and causes it to hang forever until Ctrl+C is pressed).

Also add conformance tests for these cases

klondikedragon · 2022-10-26T05:04:38Z

Upon further inspection, the previous iteration of the word boundary conformance test was failing because case insensitive matching wasn't implemented. Updated to fix case insensitive matching and add new tests for this case. And then changed the word boundary test to not rely on case insensitive matching.

Also add conformance test for this case

If \b is used at start of pattern, it could match right before the current position, rather than right after. Check both cases.

klondikedragon · 2022-10-26T06:04:22Z

Identified and fixed another edge case where literal matches were not working if they were at exactly the end of the string.

Also identified and fixed the issue with matching \b if it could match immediately before the current position p instead of right after. This is common in keyword patterns where you might have something like \b(SELECT|FROM|...)\b and want to ensure that both before and after the keyword is a word boundary.

alecthomas · 2022-10-26T06:17:38Z

cmd/participle/gen_lexer_cmd.go

-		if n == 1 {
-			fmt.Fprintf(w, "if p < len(s) && s[p] == %q {\n", re.Rune[0])
+		if re.Flags&syntax.FoldCase != 0 {
+			fmt.Fprintf(w, "if p+%d <= len(s) && strings.EqualFold(s[p:p+%d], %q) {\n", n, n, string(re.Rune))


alecthomas

Overall LGTM, just a couple of suggestions.

alecthomas · 2022-10-26T07:14:27Z

lexer/internal/conformance/conformance_test.go

@@ -18,6 +18,9 @@ var conformanceLexer = lexer.MustStateful(lexer.Rules{
 	"Root": {
 		{"String", `"`, lexer.Push("String")},
 		// {"Heredoc", `<<(\w+)`, lexer.Push("Heredoc")},
+		{"LiteralTest", `LITTEST:`, lexer.Push("LiteralTest")},


Yeah nice, I like this pattern.

Would you mind converting the existing tests to use this pattern too?

alecthomas · 2022-10-26T07:16:55Z

lexer/internal/conformance/conformance_test.go

+			{"LITKeyword", "SELECT"},
+		}},
+		{"LiteralMixed", `LITTEST:hello ONE test LIKE world`, []token{
+			{"LiteralTest", "LITTEST:"},


Maybe we could make the test harness strip these automatically? ie. the first token iff it ends with TEST:

alecthomas · 2022-10-26T07:17:39Z

lexer/internal/conformance/conformance_test.go

@@ -117,7 +200,7 @@ func TestLexerConformanceGenerated(t *testing.T) {
 	args := []string{"test", "-run", "TestLexerConformanceGeneratedInternal", "-tags", "generated"}
 	// Propagate test flags.
 	flag.CommandLine.VisitAll(func(f *flag.Flag) {
-		if f.Value.String() != f.DefValue {
+		if f.Value.String() != f.DefValue && f.Name != "test.run" {


Nice catch 🤦‍♂️

klondikedragon · 2022-10-26T23:45:56Z

@alecthomas - I did some cleanup, see what you think. Thanks!

alecthomas · 2022-10-26T23:49:00Z

Awesome, thanks :)

klondikedragon added 2 commits October 25, 2022 21:04

Allow gen conformance test to be run individually

8e57a8e

Conformance test for word boundary \b

44abf1a

klondikedragon mentioned this pull request Oct 26, 2022

Streamline long term maintenance of lexer/parser code generators #264

Open

6 tasks

Fix case insensitive matching in gen lexer

1eb5ad6

Also add conformance tests for these cases

klondikedragon added 2 commits October 26, 2022 00:00

Fix match literal at end of string in gen lexer

db2014f

Also add conformance test for this case

Fix word boundary case in gen lexer

869d6aa

If \b is used at start of pattern, it could match right before the current position, rather than right after. Check both cases.

klondikedragon marked this pull request as ready for review October 26, 2022 06:04

klondikedragon changed the title ~~Conformance test for word boundary \b~~ Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) Oct 26, 2022

alecthomas reviewed Oct 26, 2022

View reviewed changes

alecthomas approved these changes Oct 26, 2022

View reviewed changes

Cleanup conformance tests based on code review

cc72e88

klondikedragon requested a review from alecthomas October 26, 2022 23:45

alecthomas approved these changes Oct 26, 2022

View reviewed changes

alecthomas merged commit fb225ea into alecthomas:master Oct 26, 2022

klondikedragon deleted the test/conformance-wordboundary branch October 28, 2022 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) #274

Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) #274

klondikedragon commented Oct 26, 2022

klondikedragon commented Oct 26, 2022

klondikedragon commented Oct 26, 2022

alecthomas Oct 26, 2022

alecthomas left a comment

alecthomas Oct 26, 2022

alecthomas Oct 26, 2022

alecthomas Oct 26, 2022

alecthomas Oct 26, 2022

klondikedragon commented Oct 26, 2022

alecthomas commented Oct 26, 2022

Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) #274

Fix gen lexer word boundary, case insensitive, and literal matching cases (plus conformance tests) #274

Conversation

klondikedragon commented Oct 26, 2022

klondikedragon commented Oct 26, 2022

klondikedragon commented Oct 26, 2022

alecthomas Oct 26, 2022

Choose a reason for hiding this comment

alecthomas left a comment

Choose a reason for hiding this comment

alecthomas Oct 26, 2022

Choose a reason for hiding this comment

alecthomas Oct 26, 2022

Choose a reason for hiding this comment

alecthomas Oct 26, 2022

Choose a reason for hiding this comment

alecthomas Oct 26, 2022

Choose a reason for hiding this comment

klondikedragon commented Oct 26, 2022

alecthomas commented Oct 26, 2022