Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize lock file format for git merge conflicts #7070

Merged
merged 1 commit into from
Jul 23, 2019

Conversation

alexcrichton
Copy link
Member

This commit is an attempt to refine Cargo's lock file format to generate
less git merge conflicts for lock file updates as well as make it easier
to manage lock file updates. The new lock file format has a few major changes:

  • The [metadata] table is no longer used to track checksums. The
    packages themselves now list checksum fields directly.

  • The entries in the dependencies key no longer unconditionally
    mention the version/source of the dependency. When unambiguous only
    the name or only the name/version are mentioned.

As mentioned before the goal here is to reduce git merge conflict
likelihood between two cargo updates to a lock file. By not using
[metadata] the updates to package checksums should only happen on the
package itself, not in a shared metadata table where it's easy to
conflict with other updates. Additionally by making dependencies
entries shorter it's hoped that updating a crate will only either add
entries to a lock file or update lines related to just that package.
It's theorized that the amount of updates necessary to a lock file are
far less than today where the version has to be updated in many
locations.

As with all lock file format changes this new format is not turned on by
default. Support is added so Cargo will preserve it if it sees it (and
tests are added too), and then some time down the road we can flip the
switch and turn this on by default.

@rust-highfive
Copy link

r? @ehuss

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 25, 2019
@alexcrichton
Copy link
Member Author

Ah meant to do this earlier, but here's a diff of the old vs new format for Cargo's own Cargo.lock

@ehuss
Copy link
Contributor

ehuss commented Jun 25, 2019

Nice!

Do you think this will be solid enough that eventually rust-lang/rust can remove the -merge git attribute on Cargo.lock?

Is there any way to test this from the command-line locally? I couldn't see a way to force it to generate a new lock file.

@alexcrichton
Copy link
Member Author

I'd hope so yeah! I wanted to get this up as an idea before I completely forgot about it and it's something I've been meaning to do for a long time. A perfect analysis of this PR would scrape github for a set of Cargo.lock changes over time. We could then compare the diff to a diff as-if the lock file were encoded with this proposed encoding. In theory we could even objectively conclude that this produces small diffs and infer that the likelihood of a merge conflict goes down.

For testing locally there's no configuration to generate the new format, just preservation of the new format if it's found. If you edit a Cargo.lock though to remove the version/source information in one of the dependencies entries (just one, no need to do the entire lock file) then Cargo will think it's the new format and change it. For example with Cargo I changed memchr 2.2.0 (registry+https://github.com/rust-lang/crates.io-index) to memchr and then ./target/debug/cargo fetch generated a new lock file.

To be clear as well, I'm in no rush to land this. I wouldn't mind taking more time to evaluate this (or rather finding time to properly evaluate this). Additionally it's probably worthwhile to at least try to brainstorm better encodings than the one I randomly invented here, as there's surely better things we could do too!

@lukaslueg
Copy link
Contributor

I probably not the only one with code that depends on the lockfile's format. Maybe the lockfile could at some point get a version-field so the need for format detection does not ripple downstream?

@alexcrichton
Copy link
Member Author

Ok so I've done a bit of initial analysis of this. I walked the entire history of the rust-lang/rust repository, found all commits that changed Cargo.lock, recorded the before/after files, and then regenerated the lock files using this new algorithm. I then compared the diff (as calculated by diff -u CLI tool) of the old format (what's actually in the history) and the new format (what's proposed here). The statistics I got were:

  • V2 format generates 22% smaller diffs on average, with a 29% standard deviation
  • V2 format diff was 5x larger in one commit. This commit removed a duplicate version of lazy_static, so the V1 format deleted one entry where the V2 format had to update all the lazy_static 0.2.x dependencies to lazy_static
  • V2 format diff can be up to 85% smaller. In this commit the version of getopts was updated and the V2 diff was very small since the update was localized.

A historgram of the new diff size over the old diff size as a percent looks like

chart

Looks like there's a huge spike of basically creating the same sized diff. That seems to happen a lot in rust-lang/rust due to additions/removals of dependencies. The lock file is so big that it's mostly just dependency edges getting removed and the checksum section doesn't change much. Additionally rust-lang/rust has a lot of path dependencies which aren't listed with a checksum.


Now all of this is only a proxy really for the actual metric we care about here, which is reducing the likelihood that two modifications to Cargo.lock will conflict according to git. I'm not actually sure how to measure that directly in an easy-ish fashion. I could try though to maybe test a few by hand whenever a PR conflicts in rust-lang/rust and see if it works. @ehuss do you have thoughts on perhaps more data you'd like to see?

@ehuss
Copy link
Contributor

ehuss commented Jul 10, 2019

How much of a concern is it that after a clean git merge, the resulting Cargo.lock is not "complete". That is, if you run with --locked it will fail?

Here's an example. A workspace with members a, b, and c. a has a dependency on foo 1.0.0. In a branch, c adds a dependency to foo 1.0.0. Independently on master, b adds a dependency on foo 2.0.0. When c merges to master, it will merge cleanly. However its dependency is still listed as just foo. This will fail with --locked because it should be listed as foo 1.0.0.

@ehuss
Copy link
Contributor

ehuss commented Jul 10, 2019

The same thing happens when there are multiple versions of a package. Say there are dependencies on foo 1.0.0 and foo 2.0.0. A branch adds a dependency to foo 2.0.0. Then on master, foo is updated to 2.0.1. When the branch merges, it still references foo 2.0.0 when it should reference 2.0.1.

These should still be functional, but will fail with --locked.

I'm wondering if this will be pretty rare, at least on repos the size of rust-lang/rust? When a rust-lang/rust PR attempts to be tested with bors, it should fail within a couple minutes so it may not be too painful?

@ehuss
Copy link
Contributor

ehuss commented Jul 10, 2019

FWIW, I simulated a bunch of Cargo.lock merge conflicts I've received on rust-lang/rust over the past year using the V2 format. None of them failed, and they are all valid and pass with --locked.

@alexcrichton
Copy link
Member Author

How much of a concern is it that after a clean git merge, the resulting Cargo.lock is not "complete". That is, if you run with --locked it will fail?

An excellent question! I would personally say that this is not a concern, but with nuance.

In general I think we want Cargo.lock to get out of the way as much as possible in most workflows, in general being entirely automatically managed. I would take this to mean that Cargo should as aggressively as possible try to recover from a git merge creating a bad lock file.

An example of this is that historically Cargo would return a parse error for a lock file that was semantically valid. For example if you had a dependency edge pointing to a nonexistent package Cargo would fail parsing that (because Cargo would never itself produce that). We ended up landing a PR though that basically swallowed the errors. The goal with it was to take as much valid information as possible from the lock file but if it doesn't all match up then we'll re-resolve anyway.

There's a separate issue in this repository for we should literally handle lockfiles with git merge markers (think >>>>>>>>>>>>>> and friends). I actually think it'd be pretty cool to do this, and the basic idea would be that we would largely automatically handle git merge conflicts (to the best extent that we could). I think there's existing prior art for this in package managers like yarn. (IIRC)

Overall I see this as a general extension of Cargo's "incremental updates" feature. If you get a bad git merge then Cargo should still accept the lock file and just automatically update it, but when doing so we shouldn't shake up the entire world. Instead we'll just try to extract as much that's valid as we can and progress with that.

Does that make sense and sound reasonable?

I'm wondering if this will be pretty rare, at least on repos the size of rust-lang/rust? When a rust-lang/rust PR attempts to be tested with bors, it should fail within a couple minutes so it may not be too painful?

I suspect it'll definitely come up over time, but yeah the failure should happen pretty quickly since it shows up as soon as you try to build anything basically.

FWIW, I simulated a bunch of Cargo.lock merge conflicts I've received on rust-lang/rust over the past year using the V2 format. None of them failed, and they are all valid and pass with --locked.

To clarify, were these git merge conflicts reported by bors or by git? I think we've got that binary attribute which causes any changes to be considered a git merge conflict, but git may be able to already merge some things today that bors rejects.

@ehuss
Copy link
Contributor

ehuss commented Jul 10, 2019

To clarify, were these git merge conflicts reported by bors or by git?

The initial list was from bors (using this query). Some of them are legitimate conflicts (that is, even without the -merge attribute) due to the structure of the V1 format. With the V2 format, IIRC there were no git conflicts detected.

@alexcrichton
Copy link
Member Author

Nice! And thanks for checking that!

That plus what I was seeing makes me pretty confident in this PR, so in that sense I don't have much more investigation I'd like to do :)

// have one version for this name. If we have more than one version
// for the name then it's ambiguous which one we'd use. That
// shouldn't ever actually happen but in theory bad git merges could
// produce invalid lock files, so silently ignore these cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be (loud) error when merge creates malformed lockfile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you've got a compelling reason to do so, we've got historical rationle for not doing this. In general Cargo.lock can be confusing for new users, especially if scary merge messages are happening. What's happening in this case is concurrent updates, and Cargo should handle that the way it does all other updates (aka edits to Cargo.toml) which is to automatically, silently, and incrementally update Cargo.lock.

@est31
Copy link
Member

est31 commented Jul 12, 2019

//! * Wait a "long time" (e.g. 6 months or so)

I can recall two prior changes of Cargo.lock format:

The second change was backwards compatible. Older rustcs would only remove it but they'd still parse the Cargo.lock. Here, there was a 12 week period.

The first change was not backwards compatible. There was a year long period. It was still too short for multiple people when the switch was eventually done.

This change is, similar to the first change, not backwards compatible. I want to discuss the following options for handling the switch:

Option A: waiting six months like suggested in the PR.

Option B: waiting one year. This would be analogous to the [root] removal. A point against is that [root] removal didn't bring a benefit while this change actually is beneficial. A point in favour is that backwards compatibility is more important, at least for the people with backwards compatibility in mind who'd be upset by shorter delay times.

Option C: using the powers of opt-ins. There seems to be a zero sum situation if the decision of the format is hardcoded into Cargo.lock between people who want backwards compatibility and people who want to avoid merge conflicts. One group wants it to be earlier, the other group wants it later, classical zero sum. The situation becomes positive sum with opt-ins. At the start, it would be opt-in for people who want to avoid merge conflicts, one year (or some other time period) later defaults switch and it's opt-in for people who want backwards compatibility. Both groups would gain most with option C: the people who don't want merge conflicts can easily switch immediately after it becomes stable (without having to patch their Cargo), while the people who want backwards compatibility can use the old format indefinitely. A strawman proposal for exposed syntax is cargo check --lock-file-format latest for opting into the new Cargo.lock format and lock-file-format = "1" in Cargo.toml for opting into the old format (number increasing for newer Cargo.lock formats).

I'd personally prefer option C over option B over option A. A variant of option C would be to tie everything to editions but I doubt that the people who don't want merge conflicts want to wait three years :).

@RalfJung
Copy link
Member

Is it correct that this will remove the [metadata] table entirely? Is there an easy replacement for checking duplicate crates in the dependencies of a project? For this I currently usually scroll over the [metadata] table, which is sorted and hence makes it fairly easy to visually spot duplicate crate names. In particular this is possible in diffs to quickly spot if a cargo update introduced crate duplication.

@bors
Copy link
Contributor

bors commented Jul 15, 2019

☔ The latest upstream changes (presumably #7127) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member Author

@RalfJung yes, this will remove [metadata] entirely and you'll have to scan adjacent [package] annotations or use something like grep/cargo-tree/etc.

@est31 I'll create a tracking issue for this if it lands and the roll-out strategy can be discussed there.

@joshtriplett
Copy link
Member

I agree with the earlier comment that it'd be nice to have a version field. Or, alternatively, a features field would work.

@alexcrichton
Copy link
Member Author

Ok we discussed this at the cargo triage meeting today and the conclusion was that there's no major blockers that came up. I've opened a separate issue about versioning and version numbers, which we decided is orthogonal from this PR it doesn't need to block this PR.

r? @ehuss for a final review of the bits and pieces?

@ehuss
Copy link
Contributor

ehuss commented Jul 22, 2019

@alexcrichton can you cargo +stable fmt?

Do we need to expedite the creation of a rustfmt bot? 😄

Just beware, there's an issue with nightly rustfmt treating some zalgo text differently from stable (in test_progress_status). The nightly version is more correct (it's a bug fix), but does make it awkward. I've considered changing the text to be unicode escapes, but it is hundreds of bytes long.

This commit is an attempt to refine Cargo's lock file format to generate
less git merge conflicts for lock file updates as well as make it easier
to manage lock file updates. The new lock file format has a few major changes:

* The `[metadata]` table is no longer used to track checksums. The
  packages themselves now list `checksum` fields directly.

* The entries in the `dependencies` key no longer unconditionally
  mention the version/source of the dependency. When unambiguous only
  the name or only the name/version are mentioned.

As mentioned before the goal here is to reduce git merge conflict
likelihood between two cargo updates to a lock file. By not using
`[metadata]` the updates to package checksums should only happen on the
package itself, not in a shared metadata table where it's easy to
conflict with other updates. Additionally by making `dependencies`
entries shorter it's hoped that updating a crate will only either add
entries to a lock file or update lines related to just that package.
It's theorized that the amount of updates necessary to a lock file are
far less than today where the version has to be updated in many
locations.

As with all lock file format changes this new format is not turned on by
default. Support is added so Cargo will preserve it if it sees it (and
tests are added too), and then some time down the road we can flip the
switch and turn this on by default.
yvt added a commit to yvt/Stella2 that referenced this pull request Feb 7, 2020
This new format is designed to cause less git merge conflicts for lock
file updates. See <rust-lang/cargo#7070>.
yvt added a commit to yvt/Stella2 that referenced this pull request Feb 7, 2020
This new format is designed to cause less git merge conflicts for lock
file updates. See <rust-lang/cargo#7070>.
jurre added a commit to dependabot/dependabot-core that referenced this pull request Oct 9, 2020
In rust-lang/cargo#7070 the lockfile format was
updated, instead of a top-level `[metadata]` key that holds the
checksums for dependencies, these checksums are now specified in-line
with the dependency.

Since we shell out to `cargo` for most things, and in our tests only
explicitly check for the presence of the `sha` values, this change is
fairly minimal.

I've opted to remove the checksums in most of our fixtures, as they are
not strictly needed in our tests. I could have left them in there and
things would still pass (`cargo` will silently upgrade them to the new
format), however it seemed noisy.
jurre added a commit to dependabot/dependabot-core that referenced this pull request Oct 9, 2020
In rust-lang/cargo#7070 the lockfile format was
updated, instead of a top-level `[metadata]` key that holds the
checksums for dependencies, these checksums are now specified in-line
with the dependency.

Since we shell out to `cargo` for most things, and in our tests only
explicitly check for the presence of the `sha` values, this change is
fairly minimal.

I've opted to remove the checksums in most of our fixtures, as they are
not strictly needed in our tests. I could have left them in there and
things would still pass (`cargo` will silently upgrade them to the new
format), however it seemed noisy.
@ehuss ehuss added this to the 1.38.0 milestone Feb 6, 2022
Eric-Arellano added a commit to pantsbuild/pants that referenced this pull request Aug 10, 2022
Improves upon #16420. @huonw wisely suggested we avoid using `hash` for constraints files because it makes Git diffs and merge conflicts much worse. python-poetry/poetry#496 rust-lang/cargo#7070

[ci skip-rust]
[ci skip-build-wheels]
cczona pushed a commit to cczona/pants that referenced this pull request Sep 1, 2022
…sbuild#16469)

Improves upon pantsbuild#16420. @huonw wisely suggested we avoid using `hash` for constraints files because it makes Git diffs and merge conflicts much worse. python-poetry/poetry#496 rust-lang/cargo#7070

[ci skip-rust]
[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.