Formatter removes parts of the grammar and is not idempotent #25

matthiasbalke · 2018-12-28T18:35:28Z

I just applied a local build of my spotless antlr4 formatter on antlr/grammars-v4 several times in a row and everytime the sources changed. To me It looks like the formatter is removing stuff. So I did the same using the standalone JAR to check if it is a bug in my spotless integration, but the result was the same.

Here's what I did:

checkout https://github.com/antlr/grammars-v4
downloaded the released standalone formatter
move the jar into checkout directory
apply formatter by calling java -jar antlr4-formatter-standalone-1.1.0.jar -dir .
commit the changes
apply formatter again java -jar antlr4-formatter-standalone-1.1.0.jar -dir .
expect to have no modified files, as the sources are already formatted

But afterwards the following files where modified:

antlr4/ANTLRv4LexerPythonTarget.g4
antlr4/examples/twocomments.g4
apex/apex.g4
c/C.g4
csharp/CSharpPreprocessorParser.g4
csharp/CSharpSharwell/CSharpPreprocessorParser.g4
java8-js/Java8.JavaScriptTarget.g4
java8-ts/Java8.TypeScriptTarget.g4
java8/Java8.g4
java9/Java9.g4
javascript/JavaScriptParser.g4
php/Python/PhpLexer.g4
plsql/PlSqlParser.g4
python2-js/Python2.g4
python2/Python2.g4
python3-py/Python3.g4
rexx/RexxLexer.g4
swift3/Swift3.g4
system_verilog/SysVerilogHDL.g4
tsql/TSqlParser.g4
ucb-logo/UCBLogo.g4
z/ZLexer.g4

Taking a look at java8/Java8.g4 the first formatting removes some comments (which I think it shouldn't)

Shortened Diff:

  // §3.10.6 Escape Sequences for Character and String Literals
 
-fragment
-EscapeSequence
-	:	'\\' [btnfr"'\\]
-	|	OctalEscape
-    |   UnicodeEscape // This is not in the spec but prevents having to preprocess the input
-	;
-
-fragment
-OctalEscape
-	:	'\\' OctalDigit
-	|	'\\' OctalDigit OctalDigit
-	|	'\\' ZeroToThree OctalDigit OctalDigit
-	;
-
-fragment
-ZeroToThree
-	:	[0-3]
-	;
+fragment EscapeSequence
+   : '\\' [btnfr"'\\] | OctalEscape | UnicodeEscape
+   ;
+
+
+fragment OctalEscape
+   : '\\' OctalDigit | '\\' OctalDigit OctalDigit | '\\' ZeroToThree OctalDigit OctalDigit
+   ;
+
+
+fragment ZeroToThree
+   : [0-3]
+   ;

but the second formatting removes parts of the grammar:
Complete Diff:

diff --git a/java8/Java8.g4 b/java8/Java8.g4
index 0596456..e16ba95 100644
--- a/java8/Java8.g4
+++ b/java8/Java8.g4
@@ -2047,20 +2047,13 @@ fragment JavaLetter
    // covers all characters above 0x7F which are not a surrogate ~ [\u0000-\u007F\uD800-\uDBFF]{Character.isJavaIdentifierStart(_input.LA(-1))}?|
    // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF[\uD800-\uDBFF][\uDC00-\uDFFF]{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?;

-
    fragment JavaLetterOrDigit
       : [a-zA-Z0-9$_] |
       // covers all characters above 0x7F which are not a surrogate ~ [\u0000-\u007F\uD800-\uDBFF]{Character.isJavaIdentifierPart(_input.LA(-1))}?|
       // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF[\uD800-\uDBFF][\uDC00-\uDFFF]{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?;
-
       //
       // Additional symbols not defined in the lexical specification
-      //
-
-      AT
-         : '@'
-         ;
-
+      // AT

       ELLIPSIS
          : '...'

I think we should extract the non working grammar parts and write unit tests for each. This way we are testing Antlr4ParseTreeListenerImpl.java which is anyway very complex and could use some refactoring afterwards.

The text was updated successfully, but these errors were encountered:

teverett · 2018-12-28T21:13:28Z

@matthiasbalke I'm fine with this, and with PR's. I'll make an effort when I can find time.

matthiasbalke · 2018-12-28T21:59:53Z

@teverett I created a minimal unit test to reproduce the removal of inline comments (not sure if they are called like that), this seems to be the root of the described problem.

If you like take a look at the test: https://github.com/matthiasbalke/Antlr4Formatter/blob/issue-25-idempotent/antlr4-formatter/src/test/java/com/khubla/antlr4formatter/Antlr4FormatterTest.java#L29

I think it might be an easy one for you to fix. In the meantime I try to find the cause of it myself, but as I'm very new to ANTLR I 'm not sure if I find it.

Anyway it's a very interesting topic and I never meant to push you.

teverett · 2018-12-28T22:44:48Z

Ya, I'm looking it now, and it has to do with hidden comments that are to the *right* of the current context. Good times!

…

On Fri, Dec 28, 2018 at 2:59 PM Matthias Balke ***@***.***> wrote: @teverett <https://github.com/teverett> I created a minimal unit test to reproduce the removal of inline comments (not sure if they are called like that), this seems to be the root of the described problem. If you like take a look at the test: https://github.com/matthiasbalke/Antlr4Formatter/blob/issue-25-idempotent/antlr4-formatter/src/test/java/com/khubla/antlr4formatter/Antlr4FormatterTest.java#L29 I think it might be an easy one for you to fix. In the meantime I try to find the cause of it myself, but as I'm very new to ANTLR I 'm not sure if I find it. Anyway it's a very interesting topic and I never meant to push you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABK1Zv4fBiniloFZFe2YV2MGf5Q_Gi5wks5u9pRagaJpZM4ZkHUN> .

teverett · 2018-12-29T00:37:58Z

@matthiasbalke ok i have a fix, I think, for the comments issue. Will commit tomorrow.

matthiasbalke · 2018-12-29T09:17:32Z

@teverett I'm exited. Have a good night.

matthiasbalke · 2019-01-02T17:33:28Z

@teverett I saw your code fix and updated my tests accordingly. Now the missing comments gets duplicated.

I also identified another code fragment, including comments which gets broken by the formatter and added another test for complex fragments.

Also I added the Java8 grammar as a kind of integration test for a real grammar.

matthiasbalke · 2019-05-20T12:56:18Z

Are there any updates on formatting comments correctly? I think last time I checked, the comments got duplicated.

teverett · 2019-05-20T16:50:03Z

Sorry, work has been very busy.

matthiasbalke · 2019-06-03T11:21:51Z

I would like to merge the spotless integration PR diffplug/spotless#328, what do you think? Do you think you find some time to fix the issue the next weeks, or should we merge the integration and fix the formatter afterwards?

nedtwigg · 2019-06-06T03:58:38Z

Hiya! Very excited to have Antlr support in Spotless. A rare corner case that duplicates comments is fine, and our paddedCell functionality can actually work around a bug of this sort. But if code gets deleted, that's a pretty serious bug that we can't work around at our level.

Happy to merge your PR and release to the public anytime, but I'd like to make sure there isn't a known "code-delete" bug when we do.

teverett · 2019-06-07T13:24:31Z

@matthiasbalke @nedtwigg

Work has been exceptionally busy, and I just haven't had time to dive in. I apologize for any inconvenience this has caused.

nedtwigg · 2019-06-07T21:43:31Z

No apologies! Volunteer open source ought to be fun, not an obligation, and the point of it being open is that any of us who want the feature can jump in and do it ourselves. Thanks for making and maintaining the formatter! We're excited to share your work with the spotless audience whenever somebody gets around to figuring out this issue.

teverett · 2019-06-08T17:52:53Z

@nedtwigg thanks
@matthiasbalke could you take a look now? I believe I've fixed the idempotence issue. There is still a small problem with spaces, however it's quite minor.

matthiasbalke · 2019-06-12T06:25:08Z

@teverett I'll check it out! Thanks for the effort!

matthiasbalke · 2019-06-12T08:44:12Z

@teverett I have tested the know edge cases (duplicating the comments) and it seem to be fine now.

But I have found another non idempotent formatting, which is adding spaces each formatting run. To test this I modified csharp/CSharpPreprocessorParser.g4 by running the formatter once and the reverting the buggy part. I've added this as a test case in #33.

diff --git a/src/main/antlr4/CSharpPreprocessorParser.g4 b/src/main/antlr4/CSharpPreprocessorParser.g4
index 297e5ca..bb84c19 100644
--- a/src/main/antlr4/CSharpPreprocessorParser.g4
+++ b/src/main/antlr4/CSharpPreprocessorParser.g4
@@ -25,7 +25,7 @@ private boolean allConditions() {
        return true;
 }
 }
-preprocessor_directive returns [boolean value]
+preprocessor_directive returns [ b o o l e a n   v a l u e ]
    : DEFINE CONDITIONAL_SYMBOL directive_new_line_or_sharp
    { ConditionalSymbols.add($CONDITIONAL_SYMBOL.text);
           $value = allConditions(); } # preprocessorDeclaration
@@ -61,7 +61,7 @@ directive_new_line_or_sharp
    | EOF
    ;

-preprocessor_expression returns [String value]
+preprocessor_expression returns [ S t r i n g   v a l u e ]
    : TRUE
    { $value = "true"; }
    | FALSE

@nedtwigg Maybe this could be handled by using paddedCell() for the first release?

nedtwigg · 2019-06-12T17:05:12Z

Since we don't have a known data loss bug any more, Spotless would be happy to merge and release your work (once this has been published to mavenCentral). If you run the formatter a second time, does it add more spaces again? e.g.

[String value]
[S t r i n g   v a l u e ]
[S  t  r  i  n  g     v  a  l  u  e  ]

If so, then paddedCell will automatically go with [String value]. But if it converges on [S t r i n g v a l u e ], then that's what paddedCell will go with. Either way, this isn't a data loss bug, and users can exclude specific files manually, so I'm happy to release to the Spotless audience.

matthiasbalke · 2019-06-13T05:35:54Z

@nedtwigg it keeps adding spaces each run.

@teverett My only question remaining is, does the misformated grammar part work as expected? Or will it cause errors? But as @nedtwigg said, users could exclude files using syntax like that, in case it leads to unexpected issues.

Feel free to release this version to Maven Central. As soon as it is published, I will update my PR to use it, as default version.

teverett · 2019-06-15T22:30:18Z

@matthiasbalke @nedtwigg we can't pull this quite yet. The grammar resulting from formatting the latest put isn't invalid.

nedtwigg · 2019-10-11T16:30:30Z

Hiya! Just wanted to take a shot at getting this useful piece of code into some more users hands :)

The grammar resulting from formatting the latest put isn't invalid.

Is the only known problem:

[String value]
[S t r i n g   v a l u e ]
[S  t  r  i  n  g     v  a  l  u  e  ]

If so, Spotless with paddedCell can fix it already. I think it's okay to release something that's a little broken. But it's also okay to decide that other things are more important, and let this continue to sit indefinitely :)

matthiasbalke · 2019-10-29T08:39:02Z

@nedtwigg I didn't manage to get this fixed using paddedCell(). It always kept the second variation, using one space between each char.

matthiasbalke · 2020-03-04T07:27:43Z

I'm just curious, but is anybody from the @antlr org able and willing to help out to fix this - as it seems to be the last - formatting bug before we can release the first working version?

@teverett, @parrt do you know have anybody in mind or don't mind taking a look into it yourself?
I think this would be a great addition for the antlr community.

parrt · 2020-03-04T16:51:09Z

I don't have time to look into this sorry

matthiasbalke · 2020-03-10T13:26:02Z

As nobody has time to fix this bug, I took a look at it. I don't fully understand whats going on, but found a workaround: #37

I think it's good enough for a first version, as it does not destroy embedded code anymore.
@teverett can you please take a look at it and merge it if it's ok for you?

teverett · 2020-03-10T14:01:32Z

@matthiasbalke I have been very busy and unable to find time for this. Thank-you for this. I will take a look.

workaround for broken formatting of action blocks (#25)

matthiasbalke · 2020-03-15T16:28:24Z

@teverett thx for merging. can you release a new version including this bug fix to maven central, in the near future?

teverett · 2020-03-21T18:45:25Z

Release 1.2.1 done.

…

Matthias Balke ***@***.***> March 15, 2020 at 10:28 AM @teverett <https://github.com/teverett> thx for merging. can you release a new version including this bug fix to maven central, in the near future? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJLKZULL7RGVYWUR5DJHZTRHT63JANCNFSM4GMQOUGQ>.

-- Tom Everett tom@khubla.com <mailto:tom@khubla.com> <https://www.postbox-inc.com>

teverett · 2020-03-21T18:46:04Z

@matthiasbalke thanks again for the fix.

nedtwigg · 2020-05-06T17:58:35Z

Now that 1.2.1 is on mavencentral, can we close out this little tree of issues?

teverett · 2020-07-01T20:02:35Z

@matthiasbalke any objection to closing this issue?

matthiasbalke · 2020-07-01T20:51:27Z

@teverett no not at all. Thx

….9.1 Bump antlr.version from 4.9 to 4.9.1

matthiasbalke mentioned this issue Dec 28, 2018

[WIP] Add ANTLR4 support diffplug/spotless#328

Merged

12 tasks

matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Dec 28, 2018

add test to reproduce issue antlr#25

f1f8e06

matthiasbalke mentioned this issue Jan 8, 2019

version2 #31

Merged

matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Jan 8, 2019

add idempotence tests for real grammars (antlr#25)

cf62d23

matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Jan 8, 2019

compare expected and actual without trimming (antlr#25)

8392573

matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Jan 8, 2019

split comments from grammar parts in expected result (antlr#25)

8d09acd

matthiasbalke mentioned this issue Feb 1, 2020

wrong formatting #36

Open

matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Mar 10, 2020

workaround for broken formatting of action blocks (antlr#25)

6fca009

teverett added a commit that referenced this issue Mar 15, 2020

Merge pull request #37 from matthiasbalke/bugfix/issue-25

65bf659

workaround for broken formatting of action blocks (#25)

teverett closed this as completed Jul 1, 2020

teverett added a commit that referenced this issue Jan 6, 2021

Merge pull request #25 from teverett/dependabot/maven/antlr.version-4…

58b4c4c

….9.1 Bump antlr.version from 4.9 to 4.9.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formatter removes parts of the grammar and is not idempotent #25

Formatter removes parts of the grammar and is not idempotent #25

matthiasbalke commented Dec 28, 2018

teverett commented Dec 28, 2018

matthiasbalke commented Dec 28, 2018

teverett commented Dec 28, 2018 via email

teverett commented Dec 29, 2018

matthiasbalke commented Dec 29, 2018

matthiasbalke commented Jan 2, 2019

matthiasbalke commented May 20, 2019

teverett commented May 20, 2019

matthiasbalke commented Jun 3, 2019

nedtwigg commented Jun 6, 2019

teverett commented Jun 7, 2019

nedtwigg commented Jun 7, 2019

teverett commented Jun 8, 2019

matthiasbalke commented Jun 12, 2019

matthiasbalke commented Jun 12, 2019

nedtwigg commented Jun 12, 2019

matthiasbalke commented Jun 13, 2019

teverett commented Jun 15, 2019

nedtwigg commented Oct 11, 2019

matthiasbalke commented Oct 29, 2019

matthiasbalke commented Mar 4, 2020

parrt commented Mar 4, 2020

matthiasbalke commented Mar 10, 2020

teverett commented Mar 10, 2020

matthiasbalke commented Mar 15, 2020

teverett commented Mar 21, 2020 via email

teverett commented Mar 21, 2020

nedtwigg commented May 6, 2020

teverett commented Jul 1, 2020

matthiasbalke commented Jul 1, 2020

Formatter removes parts of the grammar and is not idempotent #25

Formatter removes parts of the grammar and is not idempotent #25

Comments

matthiasbalke commented Dec 28, 2018

teverett commented Dec 28, 2018

matthiasbalke commented Dec 28, 2018

teverett commented Dec 28, 2018 via email

teverett commented Dec 29, 2018

matthiasbalke commented Dec 29, 2018

matthiasbalke commented Jan 2, 2019

matthiasbalke commented May 20, 2019

teverett commented May 20, 2019

matthiasbalke commented Jun 3, 2019

nedtwigg commented Jun 6, 2019

teverett commented Jun 7, 2019

nedtwigg commented Jun 7, 2019

teverett commented Jun 8, 2019

matthiasbalke commented Jun 12, 2019

matthiasbalke commented Jun 12, 2019

nedtwigg commented Jun 12, 2019

matthiasbalke commented Jun 13, 2019

teverett commented Jun 15, 2019

nedtwigg commented Oct 11, 2019

matthiasbalke commented Oct 29, 2019

matthiasbalke commented Mar 4, 2020

parrt commented Mar 4, 2020

matthiasbalke commented Mar 10, 2020

teverett commented Mar 10, 2020

matthiasbalke commented Mar 15, 2020

teverett commented Mar 21, 2020 via email

teverett commented Mar 21, 2020

nedtwigg commented May 6, 2020

teverett commented Jul 1, 2020

matthiasbalke commented Jul 1, 2020