Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SequenceMatcher can chunkify suboptimally #206

Open
rogalski opened this issue Sep 23, 2021 · 13 comments
Open

SequenceMatcher can chunkify suboptimally #206

rogalski opened this issue Sep 23, 2021 · 13 comments
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Milestone

Comments

@rogalski
Copy link
Contributor

rogalski commented Sep 23, 2021

This bug is based on my experiments with darker on closed-source repo.
I am not at liberty to disclose any code. We'll have to work on recreating failure criteria on our own.

This one is hardest one to reproduce without code which triggered this exact behavior...

Particular baseline combined with particular patch triggered a case where black chunk was extremely large (~150 lines) and rather non-sensical. This caused bisect to go on and on. In result, a lot of lines not belonging to diff was reformatted as well. By introducing small heuristic via isjunk I was able to trigger smaller chunks which then got reformatted into something sensible.

We may want diving into adjusting junk heuristics in SequenceMatcher or using different matching algorithm (something that forces minimal diffs instead of human-readable diffs). Also, possibly use something like hypothesmith to try to reproduce this behavior in synthetic manner.

@akaihola
Copy link
Owner

bisect to go on and on

Like how many rounds?

Originally I had a naïve algorithm which expanded the number of extra context lines one by one. This of course caused laughable performance with large files which didn't diff well.

I thought bisection would result in a reasonable number of required rounds on any imaginable source code files, but apparently this isn't the case?

different matching algorithm (something that forces minimal diffs instead of human-readable diffs)

This is what I've been thinking might be the solution to many cases where bisection is triggered. Are you aware of any good Python implementations of such algorithms?

@akaihola akaihola added the question Further information is requested label Sep 24, 2021
@rogalski
Copy link
Contributor Author

Last time I read about that, Levenstein was supposedly best.
There is C-extension: https://pypi.org/project/Levenshtein/
I am unsure how well maintained that is or if public API is fully interchangeable.

@rogalski
Copy link
Contributor Author

Btw. I still like my heuristic to treat whitespace-only lines as junk. A lot of code in the world already use empty lines to split code into "somewhat" independent sections. Forcing chunk mechanism to break on those changes seems very intuitive to me.

@akaihola
Copy link
Owner

akaihola commented Sep 25, 2021

Sounds like your blank line heuristic would be a smaller and simpler change. Do you have a Darker patch for it?

@rogalski
Copy link
Contributor Author

matcher = SequenceMatcher(None, src.lines, dst.lines, autojunk=False)

Something along the lines below of should do the trick:

   matcher = SequenceMatcher(lambda line: all(char in string.whitespace for char in line), src.lines, dst.lines, autojunk=False)

Please note that this is not taking into consideration any performance concerns. O(1) lookup via set may be desirable.

@rogalski
Copy link
Contributor Author

Silly idea, maybe we should delegate computing diffs to git binary itself? (git diff --no-index f1.py f2.py)
We have to create some temporary files, but algorithm itself should be already quite well-tuned for code.
Also, compiled code may be faster than difflib, even with imposed overhead.

@akaihola
Copy link
Owner

I think it would be a fair approach to implement support for git diff --no-index f1.py f2.py in Darker and both compare the results on "difficult" files as well as benchmark the performance against the current implementation.

@akaihola akaihola added this to the 1.3.2 milestone Oct 5, 2021
@akaihola
Copy link
Owner

@rogalski I experimented with

matcher = SequenceMatcher(
    lambda line: all(char in string.whitespace for char in line),
    src.lines,
    dst.lines,
    autojunk=False
)

but it actually makes my results worse, not better. If two reformatted lines are separated by a blank line, and only one of them was modified by the user, both still get reformatted by Darker.

Minimal case:

import string
from difflib import SequenceMatcher

print(
    SequenceMatcher(
        lambda line: all(char in string.whitespace for char in line),
        ["old 1", "", "old 2"],
        ["new 1", "", "new 2"],
        autojunk=False
    ).get_opcodes()
)
[('replace', 0, 3, 0, 3)]

whereas

print(
    SequenceMatcher(
        None,
        ["old 1", "", "old 2"],
        ["new 1", "", "new 2"],
        autojunk=False
    ).get_opcodes()
)
[('replace', 0, 1, 0, 1), ('equal', 1, 2, 1, 2), ('replace', 2, 3, 2, 3)]

@akaihola
Copy link
Owner

akaihola commented Oct 10, 2021

delegate computing diffs to git binary itself? (git diff --no-index f1.py f2.py)

@rogalski FYI, src/darker/git_diff.py (link to commit just before removal) used to contain code for invoking git diff and parsing the result. I had to discard that approach when improving isort integration for Darker 1.0.0 (in 8a4e13e).

With --no-index (which I was unaware of until you pointed it out) we could very well use git diff again, now not only for user edits and isort results, but also for Black reformatting.

The best option would of course be to be able to get the distinct reformatted chunks from Black itself without having to diff and analyze its output. But I haven't yet looked into that. And Black hasn't yet defined a stable public API (although that hasn't stopped Darker from calling format_str(), assert_equivalent(), and find_project_root()).

@akaihola
Copy link
Owner

akaihola commented Oct 11, 2021

@rogalski, I started to experiment and see what it would take to get more information from Black in order to apply reformattings with more granularity. See #221.

Update: It does quickly get really hairy – obtaining reformattings in chunks from Black is not a silver bullet.

@akaihola akaihola modified the milestones: 1.3.2, 1.4.0, 1.4.1 Oct 28, 2021
@akaihola akaihola modified the milestones: 1.4.1, 1.4.2 Feb 10, 2022
@akaihola akaihola modified the milestones: 1.4.2, 1.5.0 Feb 18, 2022
@akaihola
Copy link
Owner

Moving this to milestone 1.5.0, there are a plenty of bugfixes and documentation improvements coming up for a 1.4.2 bugfix release already.

@akaihola akaihola modified the milestones: 1.5.0, 1.6.0 Feb 24, 2022
@akaihola
Copy link
Owner

Postponing to 1.6.0 – this one doesn't seem like a simple change.

@akaihola akaihola modified the milestones: 1.6.0, 1.6.1 Dec 19, 2022
@akaihola
Copy link
Owner

Postponing to 1.7.1.

@akaihola akaihola modified the milestones: 1.6.1, 1.7.1 Dec 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
Development

No branches or pull requests

2 participants