Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatter removes parts of the grammar and is not idempotent #25

Closed
matthiasbalke opened this issue Dec 28, 2018 · 30 comments
Closed

Formatter removes parts of the grammar and is not idempotent #25

matthiasbalke opened this issue Dec 28, 2018 · 30 comments

Comments

@matthiasbalke
Copy link
Contributor

I just applied a local build of my spotless antlr4 formatter on antlr/grammars-v4 several times in a row and everytime the sources changed. To me It looks like the formatter is removing stuff. So I did the same using the standalone JAR to check if it is a bug in my spotless integration, but the result was the same.

Here's what I did:

  1. checkout https://github.com/antlr/grammars-v4
  2. downloaded the released standalone formatter
  3. move the jar into checkout directory
  4. apply formatter by calling java -jar antlr4-formatter-standalone-1.1.0.jar -dir .
  5. commit the changes
  6. apply formatter again java -jar antlr4-formatter-standalone-1.1.0.jar -dir .
  7. expect to have no modified files, as the sources are already formatted

But afterwards the following files where modified:

  • antlr4/ANTLRv4LexerPythonTarget.g4
  • antlr4/examples/twocomments.g4
  • apex/apex.g4
  • c/C.g4
  • csharp/CSharpPreprocessorParser.g4
  • csharp/CSharpSharwell/CSharpPreprocessorParser.g4
  • java8-js/Java8.JavaScriptTarget.g4
  • java8-ts/Java8.TypeScriptTarget.g4
  • java8/Java8.g4
  • java9/Java9.g4
  • javascript/JavaScriptParser.g4
  • php/Python/PhpLexer.g4
  • plsql/PlSqlParser.g4
  • python2-js/Python2.g4
  • python2/Python2.g4
  • python3-py/Python3.g4
  • rexx/RexxLexer.g4
  • swift3/Swift3.g4
  • system_verilog/SysVerilogHDL.g4
  • tsql/TSqlParser.g4
  • ucb-logo/UCBLogo.g4
  • z/ZLexer.g4

Taking a look at java8/Java8.g4 the first formatting removes some comments (which I think it shouldn't)

Shortened Diff:

  // §3.10.6 Escape Sequences for Character and String Literals
 
-fragment
-EscapeSequence
-	:	'\\' [btnfr"'\\]
-	|	OctalEscape
-    |   UnicodeEscape // This is not in the spec but prevents having to preprocess the input
-	;
-
-fragment
-OctalEscape
-	:	'\\' OctalDigit
-	|	'\\' OctalDigit OctalDigit
-	|	'\\' ZeroToThree OctalDigit OctalDigit
-	;
-
-fragment
-ZeroToThree
-	:	[0-3]
-	;
+fragment EscapeSequence
+   : '\\' [btnfr"'\\] | OctalEscape | UnicodeEscape
+   ;
+
+
+fragment OctalEscape
+   : '\\' OctalDigit | '\\' OctalDigit OctalDigit | '\\' ZeroToThree OctalDigit OctalDigit
+   ;
+
+
+fragment ZeroToThree
+   : [0-3]
+   ;

but the second formatting removes parts of the grammar:
Complete Diff:

diff --git a/java8/Java8.g4 b/java8/Java8.g4
index 0596456..e16ba95 100644
--- a/java8/Java8.g4
+++ b/java8/Java8.g4
@@ -2047,20 +2047,13 @@ fragment JavaLetter
    // covers all characters above 0x7F which are not a surrogate ~ [\u0000-\u007F\uD800-\uDBFF]{Character.isJavaIdentifierStart(_input.LA(-1))}?|
    // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF[\uD800-\uDBFF][\uDC00-\uDFFF]{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?;

-
    fragment JavaLetterOrDigit
       : [a-zA-Z0-9$_] |
       // covers all characters above 0x7F which are not a surrogate ~ [\u0000-\u007F\uD800-\uDBFF]{Character.isJavaIdentifierPart(_input.LA(-1))}?|
       // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF[\uD800-\uDBFF][\uDC00-\uDFFF]{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?;
-
       //
       // Additional symbols not defined in the lexical specification
-      //
-
-      AT
-         : '@'
-         ;
-
+      // AT

       ELLIPSIS
          : '...'

I think we should extract the non working grammar parts and write unit tests for each. This way we are testing Antlr4ParseTreeListenerImpl.java which is anyway very complex and could use some refactoring afterwards.

@teverett
Copy link
Member

@matthiasbalke I'm fine with this, and with PR's. I'll make an effort when I can find time.

matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Dec 28, 2018
@matthiasbalke
Copy link
Contributor Author

@teverett I created a minimal unit test to reproduce the removal of inline comments (not sure if they are called like that), this seems to be the root of the described problem.

If you like take a look at the test: https://github.com/matthiasbalke/Antlr4Formatter/blob/issue-25-idempotent/antlr4-formatter/src/test/java/com/khubla/antlr4formatter/Antlr4FormatterTest.java#L29

I think it might be an easy one for you to fix. In the meantime I try to find the cause of it myself, but as I'm very new to ANTLR I 'm not sure if I find it.

Anyway it's a very interesting topic and I never meant to push you.

@teverett
Copy link
Member

teverett commented Dec 28, 2018 via email

@teverett
Copy link
Member

@matthiasbalke ok i have a fix, I think, for the comments issue. Will commit tomorrow.

@matthiasbalke
Copy link
Contributor Author

@teverett I'm exited. Have a good night.

@matthiasbalke
Copy link
Contributor Author

@teverett I saw your code fix and updated my tests accordingly. Now the missing comments gets duplicated.

I also identified another code fragment, including comments which gets broken by the formatter and added another test for complex fragments.

Also I added the Java8 grammar as a kind of integration test for a real grammar.

@matthiasbalke matthiasbalke mentioned this issue Jan 8, 2019
matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Jan 8, 2019
matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Jan 8, 2019
matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Jan 8, 2019
@matthiasbalke
Copy link
Contributor Author

Are there any updates on formatting comments correctly? I think last time I checked, the comments got duplicated.

@teverett
Copy link
Member

Sorry, work has been very busy.

@matthiasbalke
Copy link
Contributor Author

I would like to merge the spotless integration PR diffplug/spotless#328, what do you think? Do you think you find some time to fix the issue the next weeks, or should we merge the integration and fix the formatter afterwards?

@nedtwigg
Copy link
Contributor

nedtwigg commented Jun 6, 2019

Hiya! Very excited to have Antlr support in Spotless. A rare corner case that duplicates comments is fine, and our paddedCell functionality can actually work around a bug of this sort. But if code gets deleted, that's a pretty serious bug that we can't work around at our level.

Happy to merge your PR and release to the public anytime, but I'd like to make sure there isn't a known "code-delete" bug when we do.

@teverett
Copy link
Member

teverett commented Jun 7, 2019

@matthiasbalke @nedtwigg

Work has been exceptionally busy, and I just haven't had time to dive in. I apologize for any inconvenience this has caused.

@nedtwigg
Copy link
Contributor

nedtwigg commented Jun 7, 2019

No apologies! Volunteer open source ought to be fun, not an obligation, and the point of it being open is that any of us who want the feature can jump in and do it ourselves. Thanks for making and maintaining the formatter! We're excited to share your work with the spotless audience whenever somebody gets around to figuring out this issue.

@teverett
Copy link
Member

teverett commented Jun 8, 2019

@nedtwigg thanks
@matthiasbalke could you take a look now? I believe I've fixed the idempotence issue. There is still a small problem with spaces, however it's quite minor.

@matthiasbalke
Copy link
Contributor Author

@teverett I'll check it out! Thanks for the effort!

@matthiasbalke
Copy link
Contributor Author

@teverett I have tested the know edge cases (duplicating the comments) and it seem to be fine now.

But I have found another non idempotent formatting, which is adding spaces each formatting run. To test this I modified csharp/CSharpPreprocessorParser.g4 by running the formatter once and the reverting the buggy part. I've added this as a test case in #33.

diff --git a/src/main/antlr4/CSharpPreprocessorParser.g4 b/src/main/antlr4/CSharpPreprocessorParser.g4
index 297e5ca..bb84c19 100644
--- a/src/main/antlr4/CSharpPreprocessorParser.g4
+++ b/src/main/antlr4/CSharpPreprocessorParser.g4
@@ -25,7 +25,7 @@ private boolean allConditions() {
        return true;
 }
 }
-preprocessor_directive returns [boolean value]
+preprocessor_directive returns [ b o o l e a n   v a l u e ]
    : DEFINE CONDITIONAL_SYMBOL directive_new_line_or_sharp
    { ConditionalSymbols.add($CONDITIONAL_SYMBOL.text);
           $value = allConditions(); } # preprocessorDeclaration
@@ -61,7 +61,7 @@ directive_new_line_or_sharp
    | EOF
    ;

-preprocessor_expression returns [String value]
+preprocessor_expression returns [ S t r i n g   v a l u e ]
    : TRUE
    { $value = "true"; }
    | FALSE

@nedtwigg Maybe this could be handled by using paddedCell() for the first release?

@nedtwigg
Copy link
Contributor

Since we don't have a known data loss bug any more, Spotless would be happy to merge and release your work (once this has been published to mavenCentral). If you run the formatter a second time, does it add more spaces again? e.g.

[String value]
[S t r i n g   v a l u e ]
[S  t  r  i  n  g     v  a  l  u  e  ]

If so, then paddedCell will automatically go with [String value]. But if it converges on [S t r i n g v a l u e ], then that's what paddedCell will go with. Either way, this isn't a data loss bug, and users can exclude specific files manually, so I'm happy to release to the Spotless audience.

@matthiasbalke
Copy link
Contributor Author

@nedtwigg it keeps adding spaces each run.

@teverett My only question remaining is, does the misformated grammar part work as expected? Or will it cause errors? But as @nedtwigg said, users could exclude files using syntax like that, in case it leads to unexpected issues.

Feel free to release this version to Maven Central. As soon as it is published, I will update my PR to use it, as default version.

@teverett
Copy link
Member

@matthiasbalke @nedtwigg we can't pull this quite yet. The grammar resulting from formatting the latest put isn't invalid.

@nedtwigg
Copy link
Contributor

Hiya! Just wanted to take a shot at getting this useful piece of code into some more users hands :)

The grammar resulting from formatting the latest put isn't invalid.

Is the only known problem:

[String value]
[S t r i n g   v a l u e ]
[S  t  r  i  n  g     v  a  l  u  e  ]

If so, Spotless with paddedCell can fix it already. I think it's okay to release something that's a little broken. But it's also okay to decide that other things are more important, and let this continue to sit indefinitely :)

@matthiasbalke
Copy link
Contributor Author

@nedtwigg I didn't manage to get this fixed using paddedCell(). It always kept the second variation, using one space between each char.

@matthiasbalke
Copy link
Contributor Author

I'm just curious, but is anybody from the @antlr org able and willing to help out to fix this - as it seems to be the last - formatting bug before we can release the first working version?

@teverett, @parrt do you know have anybody in mind or don't mind taking a look into it yourself?
I think this would be a great addition for the antlr community.

@parrt
Copy link
Member

parrt commented Mar 4, 2020

I don't have time to look into this sorry

matthiasbalke added a commit to matthiasbalke/Antlr4Formatter that referenced this issue Mar 10, 2020
@matthiasbalke
Copy link
Contributor Author

As nobody has time to fix this bug, I took a look at it. I don't fully understand whats going on, but found a workaround: #37

I think it's good enough for a first version, as it does not destroy embedded code anymore.
@teverett can you please take a look at it and merge it if it's ok for you?

@teverett
Copy link
Member

@matthiasbalke I have been very busy and unable to find time for this. Thank-you for this. I will take a look.

teverett added a commit that referenced this issue Mar 15, 2020
workaround for broken formatting of action blocks (#25)
@matthiasbalke
Copy link
Contributor Author

@teverett thx for merging. can you release a new version including this bug fix to maven central, in the near future?

@teverett
Copy link
Member

teverett commented Mar 21, 2020 via email

@teverett
Copy link
Member

@matthiasbalke thanks again for the fix.

@nedtwigg
Copy link
Contributor

nedtwigg commented May 6, 2020

Now that 1.2.1 is on mavencentral, can we close out this little tree of issues?

@teverett
Copy link
Member

teverett commented Jul 1, 2020

@matthiasbalke any objection to closing this issue?

@matthiasbalke
Copy link
Contributor Author

@teverett no not at all. Thx

@teverett teverett closed this as completed Jul 1, 2020
teverett added a commit that referenced this issue Jan 6, 2021
….9.1

Bump antlr.version from 4.9 to 4.9.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants