Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make whitespace (' ', \t, \r, \n) always visible for "changed" lines #485

Merged
merged 4 commits into from
Nov 12, 2019

Conversation

vlsi
Copy link
Contributor

@vlsi vlsi commented Nov 8, 2019

context lines, added-only lines, and removed-only lines are shown as usual in the diffs.

fixes #465

Please make sure that your PR allows edits from maintainers. Sometimes its faster for us to just fix something than it is to describe how to fix it.

Allow edits from maintainers

After creating the PR, please add a commit that adds a bullet-point under the -SNAPSHOT section of CHANGES.md, plugin-gradle/CHANGES.md, and plugin-maven/CHANGES.md which includes:

  • a summary of the change
  • either
    • a link to the issue you are resolving (for small changes)
    • a link to the PR you just created (for big changes likely to have discussion)

If your change only affects a build plugin, and not the lib, then you only need to update the CHANGES.md for that plugin.

If your change affects lib in an end-user-visible way (fixing a bug, updating a version) then you need to update CHANGES.md for both the lib and the build plugins. Users of a build plugin shouldn't have to refer to lib to see changes that affect them.

This makes it easier for the maintainers to quickly release your changes :)

" u 2",
" u 3",
" -⇥·leading·space␊",
" -trailing·space⇥·␊",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nedtwigg , what do you think?

@vlsi
Copy link
Contributor Author

vlsi commented Nov 8, 2019

For now, I make all the whitespace visible for REPLACE-type diff fragments.
In other words, when the lines are "just added" or "just removed", then the whitespaces do not matter much, and I do not make them visible.

However, when lines are changed (for any reason), then I make all the whitespace visible just in case.

Later it would make sense to add per-line diff, then the formatter could visualize the whitespace only in case there are whitespace issues.

Copy link
Member

@nedtwigg nedtwigg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but it adds a bug related to charset encoding

In the old approach, this was how charsets were managed within the diff:

  • we take in Strings (unicode with unspecified internal binary representation)
  • we convert them to the UTF8-encoded binary, which allows to look for the '\n' byte as a line-break, and also allows us to put UTF8-encoded binary into the diff
  • we convert the UTF8-encoded diff into a Unicode String, which can then be converted to whatever binary encoding is required by the console

In the new approach, the Strings are converted to their native, on-disk binary encoding. The problem with that is we can no longer rely on the byte \n as a line-break. For example, in one of UTF-16BE vs LE, the '\n' will be in the middle of a codepoint.

If we return to the charset handling of the previous version, then this LGTM.

I remember this comment about bad encodings and the console. It's important to remember that both the console and the disk are binary interfaces, they don't care about Strings. For the console or a file to be a meaningful String, you have to encode the String to a binary representation, and there's nothing that keeps some random file to have the same encoding as some random shell displaying stdout.

It is possible there is an encoding bug somewhere in Spotless' error message reporting, but it isn't in any of the code changed by this particular PR.

@vlsi
Copy link
Contributor Author

vlsi commented Nov 9, 2019

Are you sure everything could be converted to UTF-8?
I guess there might be encodings that can't be converted to UTF-8 in a lossless way.

@vlsi
Copy link
Contributor Author

vlsi commented Nov 10, 2019

ok. DiffMessageFormatter is used for reporting purposes only, so it should not harm even in case some data is lost when converting file bytes to UTF-8.

I've added a guard to ensure all the fancy characters are convertible to Charset.defaultCharset().

In practice, "middle dot" is convertible to ISO-8859-1 just fine.

@nedtwigg
Copy link
Member

even in case some data is lost when converting file bytes to UTF-8.

A java.lang.String is a list of unicode code points, so you can always encode it into any UTF-X with no loss.

I've added a guard to ensure all the fancy characters are convertible

I love it! Useful and clever. I'm gonna push a couple API changes and then merge.

@nedtwigg
Copy link
Member

Ahh, sorry. No changelog entry yet, so I'm gonna push the API change back onto you ;-) Once these are all checked, I'll merge and publish a release:

  • all three changelogs
  • the API change described below
OLD: String diffWhitespaceLineEndings(String dirty, String clean, ...
NEW: String visualizeDiff(byte[] rawBytes, byte[] formattedBytes)

My quibble is with the String to byte[] conversion. The reason for the change was a teensy-tiny performance improvement. The downside is that we're now passing byte[] arrays which just so happen to expect to be UTF-8, but it's more fragile and less readable than if they were the semantic String. Let's ditch the optimization and go back to passing String around.

@vlsi
Copy link
Contributor Author

vlsi commented Nov 11, 2019

please review.

OLD: String diffWhitespaceLineEndings(String dirty, String clean, ...
NEW: String visualizeDiff(byte[] rawBytes, byte[] formattedBytes)

In my opinion having Strings does not help since jgit requires RawText which nowhere near String.

@vlsi
Copy link
Contributor Author

vlsi commented Nov 11, 2019

Note: OLD was diffWhitespaceLineEndings (11 lines) + visibleWhitespaceLineEndings (5 lines), so Strings made sense. However the new diffWhitespaceLineEndings is 6 lines, so Strings there hardly make any difference

@nedtwigg
Copy link
Member

Looks great, thanks!

JGit and byte[] are implementation details. It's fine for implementation details to end up in the API of private methods, but if its easy to hide them, why not!

@nedtwigg nedtwigg merged commit 1f46f14 into diffplug:master Nov 12, 2019
@nedtwigg
Copy link
Member

Thanks for the improvement! For future reference:

  • Union merge didn't seem helpful, the changelog ended up triplicated
  • Make sure to update CHANGES, plugin-maven/CHANGES, and plugin-gradle/CHANGES so that each group of stakeholders has one place to go.
  • When responding to PR feedback, it's easier to review if you push new commits rather than force-pushing a whole new branch. If you care about your PR's having clean history I'd be happy to allow you to rewrite the history at the end, but your PR's will be the only clean ones around ;-)

@nedtwigg
Copy link
Member

Released in x.26.0

@vlsi
Copy link
Contributor Author

vlsi commented Nov 12, 2019

It looks like middle dots are not safe:

GitHub Actions / Windows
https://github.com/apache/calcite/runs/299436342#step:4:287

          -�*�Type�factory�that�can�register�Java�classes�as�record�types.���\r\n
          +�*�Type�factory�that�can�register�Java�classes�as�record�types.\r\n

Travis:

https://travis-ci.org/apache/calcite/jobs/610889368#L450

          -·*·Type·factory·that·can·register·Java·classes·as·record·types.···␊
          +·*·Type·factory·that·can·register·Java·classes·as·record·types.␊

Then there's a bug that diff shows too many lines of context. It combines everything to a single diff fragment, while it should show three individual ones.

Apparently current.getEndA() - next.getBeginA() should flipped like next.getBeginA() - current.getEndA() in https://github.com/diffplug/spotless/pull/485/files#diff-59aaefa932a3b07e42c9a8bc5e32980fR142-R143

I'm afraid I'm not confident to submit a fix, so please feel free to revert the whole thing.
Have a good day 🌻

@nedtwigg
Copy link
Member

Reverted in 0868063 (but I kept the .editorconfig change ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Visualize CR and LF symbols
2 participants