-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SequenceMatcher can chunkify suboptimally #206
Comments
Like how many rounds? Originally I had a naïve algorithm which expanded the number of extra context lines one by one. This of course caused laughable performance with large files which didn't diff well. I thought bisection would result in a reasonable number of required rounds on any imaginable source code files, but apparently this isn't the case?
This is what I've been thinking might be the solution to many cases where bisection is triggered. Are you aware of any good Python implementations of such algorithms? |
Last time I read about that, Levenstein was supposedly best. |
Btw. I still like my heuristic to treat whitespace-only lines as junk. A lot of code in the world already use empty lines to split code into "somewhat" independent sections. Forcing chunk mechanism to break on those changes seems very intuitive to me. |
Sounds like your blank line heuristic would be a smaller and simpler change. Do you have a Darker patch for it? |
Line 92 in ac4f2b0
Something along the lines below of should do the trick:
Please note that this is not taking into consideration any performance concerns. O(1) lookup via set may be desirable. |
Silly idea, maybe we should delegate computing diffs to |
I think it would be a fair approach to implement support for |
@rogalski I experimented with matcher = SequenceMatcher(
lambda line: all(char in string.whitespace for char in line),
src.lines,
dst.lines,
autojunk=False
) but it actually makes my results worse, not better. If two reformatted lines are separated by a blank line, and only one of them was modified by the user, both still get reformatted by Darker. Minimal case: import string
from difflib import SequenceMatcher
print(
SequenceMatcher(
lambda line: all(char in string.whitespace for char in line),
["old 1", "", "old 2"],
["new 1", "", "new 2"],
autojunk=False
).get_opcodes()
) [('replace', 0, 3, 0, 3)] whereas print(
SequenceMatcher(
None,
["old 1", "", "old 2"],
["new 1", "", "new 2"],
autojunk=False
).get_opcodes()
) [('replace', 0, 1, 0, 1), ('equal', 1, 2, 1, 2), ('replace', 2, 3, 2, 3)] |
@rogalski FYI, With The best option would of course be to be able to get the distinct reformatted chunks from Black itself without having to diff and analyze its output. But I haven't yet looked into that. And Black hasn't yet defined a stable public API (although that hasn't stopped Darker from calling |
Moving this to milestone 1.5.0, there are a plenty of bugfixes and documentation improvements coming up for a 1.4.2 bugfix release already. |
Postponing to 1.6.0 – this one doesn't seem like a simple change. |
Postponing to 1.7.1. |
This bug is based on my experiments with darker on closed-source repo.
I am not at liberty to disclose any code. We'll have to work on recreating failure criteria on our own.
This one is hardest one to reproduce without code which triggered this exact behavior...
Particular baseline combined with particular patch triggered a case where black chunk was extremely large (~150 lines) and rather non-sensical. This caused bisect to go on and on. In result, a lot of lines not belonging to diff was reformatted as well. By introducing small heuristic via
isjunk
I was able to trigger smaller chunks which then got reformatted into something sensible.We may want diving into adjusting junk heuristics in SequenceMatcher or using different matching algorithm (something that forces minimal diffs instead of human-readable diffs). Also, possibly use something like hypothesmith to try to reproduce this behavior in synthetic manner.
The text was updated successfully, but these errors were encountered: