Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy detection differs from git weirdly #1524

Closed
fowles opened this issue Aug 15, 2024 · 12 comments
Closed

copy detection differs from git weirdly #1524

fowles opened this issue Aug 15, 2024 · 12 comments
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed

Comments

@fowles
Copy link
Contributor

fowles commented Aug 15, 2024

Current behavior 😯

cli/src/commands/{ => file}/chmod.rs
cli/{tests/test_cat_command.rs => src/commands/file/mod.rs}
cli/{tests/test_unsquash_command.rs => src/commands/file/print.rs}
cli/tests/{test_chmod_command.rs => test_file_chmod_command.rs}
cli/{src/commands/cat.rs => tests/test_file_print_command.rs}

Expected behavior 🤔

cli/src/commands/{ => file}/chmod.rs
create mode 100644 cli/src/commands/file/mod.rs
cli/src/commands/{cat.rs => file/print.rs
cli/tests/{test_chmod_command.rs => test_file_chmod_command.rs}
cli/tests/{test_cat_command.rs => test_file_print_command.rs}

Git behavior

Git does what I have listed under "expected behavior"

Steps to reproduce 🕹

I am sorry that I don't have a super simple repro. I only know how to cause this in a larger context.

  1. Install jj via cargo install jj-cli
  2. jj git clone --colocate https://github.com/martinvonz/jj
  3. cd jj
  4. cargo build
  5. target/debug/jj diff -r 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 --summary
  6. git diff 2de73f57fc9599602e001fc6331034749b2eacb0 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 --summary

The relevant code for how we use gix is https://github.com/martinvonz/jj/blob/95e8dd51ebfd61712d9fc5f6e19dd95d0026a004/lib/src/git_backend.rs#L1301

@Byron Byron added the acknowledged an issue is accepted as shortcoming to be fixed label Aug 15, 2024
@Byron Byron mentioned this issue Aug 15, 2024
58 tasks
@Byron
Copy link
Member

Byron commented Aug 15, 2024

Thanks for reporting! I'd think that it is possible to extract a test-case from the trees that show the issue.

Also, I am not surprised as Git implements a more complex algorithm to find the best possible matches. For that it keeps a list of candidates, which gitoxide does not currently do.

One cold hope that once gitoxide implements what Git does more closely, that it will get similar results.

For now I have no plans to improve this, but I should get there at some point. Contributions for this are very welcome, too.

@Byron Byron added the help wanted Extra attention is needed label Aug 15, 2024
@fowles
Copy link
Contributor Author

fowles commented Aug 15, 2024

if you help with test reduction and point me to the code involved, I could take a crack at this, but I don't know the code base well, so something to give me a jump start would be appreciated

@Byron
Copy link
Member

Byron commented Aug 16, 2024

That's a great idea and I'd appreciate you taking a stab at it. Getting a test ready should be possible to me, feel free to ping me though if it doesn't happen within the next 'days' (whatever that means 😅).

@fowles
Copy link
Contributor Author

fowles commented Aug 16, 2024

We have a few users on the jj discord mentioning that they have also seen some oddities with it selecting files for copy detection unexpectedly, so I will definitely take a crack at this after you help set up a little test that I can expand with more examples as we find them.

Byron added a commit that referenced this issue Aug 20, 2024
The pathspec list to reduce the set of files was generated with

```
git diff 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3~1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3  --stat --no-renames --name-only
```

The assets and script to reproduce the fixture was created with:

```
cargo run -p internal-tools -- git-to-sh -c2 /Users/byron/dev/github.com/martinvonz/jj assets/jj-trackcopy-1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 CHANGELOG.md cli/src/commands/cat.rs cli/src/commands/chmod.rs cli/src/commands/file/chmod.rs cli/src/commands/file/mod.rs cli/src/commands/file/print.rs cli/src/commands/mod.rs cli/tests/cli-reference@.md.snap cli/tests/runner.rs cli/tests/test_acls.rs cli/tests/test_cat_command.rs cli/tests/test_chmod_command.rs cli/tests/test_diffedit_command.rs cli/tests/test_file_chmod_command.rs cli/tests/test_file_print_command.rs cli/tests/test_fix_command.rs cli/tests/test_global_opts.rs cli/tests/test_immutable_commits.rs cli/tests/test_move_command.rs cli/tests/test_new_command.rs cli/tests/test_squash_command.rs cli/tests/test_unsquash_command.rs
```

It's notable that such issues are currently expected as the tracker
implementation isn't as precise as the one in Git, which tracks more
than one candidate.
Byron added a commit that referenced this issue Aug 20, 2024
The pathspec list to reduce the set of files was generated with

```
git diff 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3~1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3  --stat --no-renames --name-only
```

The assets and script to reproduce the fixture was created with:

```
cargo run -p internal-tools -- git-to-sh -c2 /Users/byron/dev/github.com/martinvonz/jj assets/jj-trackcopy-1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 CHANGELOG.md cli/src/commands/cat.rs cli/src/commands/chmod.rs cli/src/commands/file/chmod.rs cli/src/commands/file/mod.rs cli/src/commands/file/print.rs cli/src/commands/mod.rs cli/tests/cli-reference@.md.snap cli/tests/runner.rs cli/tests/test_acls.rs cli/tests/test_cat_command.rs cli/tests/test_chmod_command.rs cli/tests/test_diffedit_command.rs cli/tests/test_file_chmod_command.rs cli/tests/test_file_print_command.rs cli/tests/test_fix_command.rs cli/tests/test_global_opts.rs cli/tests/test_immutable_commits.rs cli/tests/test_move_command.rs cli/tests/test_new_command.rs cli/tests/test_squash_command.rs cli/tests/test_unsquash_command.rs
```

It's notable that such issues are currently expected as the tracker
implementation isn't as precise as the one in Git, which tracks more
than one candidate.
@Byron
Copy link
Member

Byron commented Aug 20, 2024

Sorry for the wait, but once #1529 is merged there is a test for the exact situation described here. Tooling was also added to allow reproducing any real-world scenario easily. With that you should be able to do whatever is needed to get similar results as Git.

Something notably absent is benchmarks, but if you are interested, you could probably set one up based on the new fixture for realistic performance in this scenario.

All the best - please feel free to let me know if you need anything else.

Byron added a commit that referenced this issue Aug 20, 2024
The pathspec list to reduce the set of files was generated with

```
git diff 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3~1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3  --stat --no-renames --name-only
```

The assets and script to reproduce the fixture was created with:

```
cargo run -p internal-tools -- git-to-sh -c2 /Users/byron/dev/github.com/martinvonz/jj assets/jj-trackcopy-1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 CHANGELOG.md cli/src/commands/cat.rs cli/src/commands/chmod.rs cli/src/commands/file/chmod.rs cli/src/commands/file/mod.rs cli/src/commands/file/print.rs cli/src/commands/mod.rs cli/tests/cli-reference@.md.snap cli/tests/runner.rs cli/tests/test_acls.rs cli/tests/test_cat_command.rs cli/tests/test_chmod_command.rs cli/tests/test_diffedit_command.rs cli/tests/test_file_chmod_command.rs cli/tests/test_file_print_command.rs cli/tests/test_fix_command.rs cli/tests/test_global_opts.rs cli/tests/test_immutable_commits.rs cli/tests/test_move_command.rs cli/tests/test_new_command.rs cli/tests/test_squash_command.rs cli/tests/test_unsquash_command.rs
```

It's notable that such issues are currently expected as the tracker
implementation isn't as precise as the one in Git, which tracks more
than one candidate.
Byron added a commit that referenced this issue Aug 20, 2024
The pathspec list to reduce the set of files was generated with

```
git diff 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3~1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3  --stat --no-renames --name-only
```

The assets and script to reproduce the fixture was created with:

```
cargo run -p internal-tools -- git-to-sh -c2 /Users/byron/dev/github.com/martinvonz/jj assets/jj-trackcopy-1 47bd6f4aa4a7eeef8b01ce168c6c771bdfffcbd3 CHANGELOG.md cli/src/commands/cat.rs cli/src/commands/chmod.rs cli/src/commands/file/chmod.rs cli/src/commands/file/mod.rs cli/src/commands/file/print.rs cli/src/commands/mod.rs cli/tests/cli-reference@.md.snap cli/tests/runner.rs cli/tests/test_acls.rs cli/tests/test_cat_command.rs cli/tests/test_chmod_command.rs cli/tests/test_diffedit_command.rs cli/tests/test_file_chmod_command.rs cli/tests/test_file_print_command.rs cli/tests/test_fix_command.rs cli/tests/test_global_opts.rs cli/tests/test_immutable_commits.rs cli/tests/test_move_command.rs cli/tests/test_new_command.rs cli/tests/test_squash_command.rs cli/tests/test_unsquash_command.rs
```

It's notable that such issues are currently expected as the tracker
implementation isn't as precise as the one in Git, which tracks more
than one candidate.
@fowles
Copy link
Contributor Author

fowles commented Aug 21, 2024

awesome! I will take a look at this later in the week

@fowles
Copy link
Contributor Author

fowles commented Aug 22, 2024

having started debugging a bit, I am pretty sure the similarity computation is what is going wrong here.

https://github.com/Byron/gitoxide/blob/242fedc973c56b6c1b6f150af99dda972a67f547/gix-diff/src/rewrites/tracker.rs#L524

In the provided example "cli/src/commands/file/mod.rs" and "cli/src/commands/cat.rs" are reported to have a 0.9977956 similarity which seems very wrong if you look at the file contents.

@Byron
Copy link
Member

Byron commented Aug 22, 2024

Thanks for sharing, that could be the spot! I still remember that I 'tuned' it to provide the same results as Git in some specific examples, so it might have been over-fitted there.
With that said, ideally the new version of it will still produce the same results as before for these examples, as these are (at least supposed to be) modelled after a Git-given baseline.

I also wonder if even with that fixed one can obtain the same results without also adding the additional heuristics and candidates to find the best possible match.

@fowles
Copy link
Contributor Author

fowles commented Aug 22, 2024

@fowles
Copy link
Contributor Author

fowles commented Aug 22, 2024

that was it. Sending a PR shortly

@fowles
Copy link
Contributor Author

fowles commented Aug 22, 2024

#1535

@fowles
Copy link
Contributor Author

fowles commented Aug 22, 2024

resolving this until we find any more issues!

@fowles fowles closed this as completed Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants