Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Cargo.toml when packaging crates #4030

Merged
merged 1 commit into from
May 18, 2017

Conversation

alexcrichton
Copy link
Member

This commit is an implementation of rewriting TOML manifests when we publish
them to the registry. The rationale for doing this is to provide a guarantee
that downloaded tarballs from crates.io can be built with cargo build
(literally). This in turn eases a number of other possible consumers of crates
from crates.io

  • Vendored sources can now be more easily modified/checked as cargo build should
    work and they're standalone crates that suffice for path dependencies
  • Tools like cargobomb/crater no longer need to edit the manifest and can
    instead perform regression testing on the literal tarballs they download
  • Other systems such as packaging Rust code may be able to take advantage of
    this, but this is a less clear benefit.

Overall I'm hesitatnt about this, unfortunately. This is a silent translation
happening on publish, a rare operation, that's difficult to inspect before it
flies up to crates.io. I wrote a script to run this transformation over all
crates.io crates and found a surprisingly large number of discrepancies. The
transformation basically just downloaded all crates at all versions,
regenerated the manifest, and then tested if the two manifests were (in memory)
the same.

Unfortunately historical Cargo had a critical bug which I think made this
exercise not too useful. Cargo used to not recreate tarballs if one already
existed, which I believe led to situations such as:

  1. cargo publish
  2. Cargo generates an error about a dependency. This could be that there's a
    version not present in a path dependency, there could be a git
    dependency, etc.
  3. Errors are fixed.
  4. cargo publish
  5. Publish is successful

In step 4 above historical Cargo would not recreate the tarball. This means
that the contents of the index (what was published) aren't guaranteed to match
with the tarball's Cargo.toml. When building from crates.io this is ok as the
index is the source of truth for dependency information, but it means that any
transformation to rewrite Cargo.toml is impossible to verify against all crates
on crates.io (due to historical bugs like these).

I strove to read as many errors as possible regardless, attempting to suss out
bugs in the implementation here. To further guard against surprises I've updated
the verification step of packaging to work "normally" in these sense that it's
not rewriting dependencies itself or changing summaries. I'm hoping that this
serves as a good last-ditch effort that what we're about to publish will indeed
build as expected when uploaded to crates.io

Overall I'm probably 70% confident in this change. I think it's necessary to
make progress, but I think there are going to be very painful bugs that arise
from this feature. I'm open to ideas to help weed out these bugs ahead of time!
I've done what I can but I fear it may not be entirely enough.

Closes #4027

@alexcrichton
Copy link
Member Author

r? @matklad

@rust-highfive
Copy link

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive assigned matklad and unassigned brson May 11, 2017
@matklad
Copy link
Member

matklad commented May 11, 2017

I'll need some time to digest it, but one thought I have right now is that perhaps it would be prudent to add original manifest to archive as well? I think having unmodified manifests could help us if something goes terribly wrong.

@alexcrichton
Copy link
Member Author

That's probably not a bad idea, yeah. Something like Cargo.toml.orig perhaps?

@matklad
Copy link
Member

matklad commented May 11, 2017

That's probably not a bad idea, yeah. Something like Cargo.toml.orig perhaps?

Any name will do as long as it is mentioned in the comment of the rewritten file.

However, mechanically rewriting a file which was created manually by the user does not look super great to me. An alternative would be to store modifications in a separate file, Cargo.patch, analogous to Cargo.lock, and then merge Cargo.toml and Cargo.patch during workspace loading. This second solution seems more clean to me, but I don't like it anyway because it's more complicated and introduces additional logic :)

@alexcrichton
Copy link
Member Author

Unfortunately the key feature here is that the Cargo.toml is overridden. in the sense that a "normal" cargo build will work in the crate, and the crate otherwise operates as a normal path dependency when the source is downloaded. Now that doesn't meant that we couldn't support a "patch" format of some kind (maybe not literally a patch file), but it means that it'd become a first-class feature of Cargo that everyone could use everywhere, which I'm not sure we want to do just yet...

@matklad
Copy link
Member

matklad commented May 12, 2017

Am I correct that this rewriting should not affect anything besides the literal "cargo build downloaded crate" use case, because we are rewriting only the summary of the manifest, and we ignore summary from Cargo.toml and use the one from the index anyway?

On the one hand, this is good, because it means that rewriting Cargo.toml is unlikely to break anything significant. On the other hand, this means that this feature will be rarely tested in the wild, and some bugs may remain uncovered.

This is a silent translation happening on publish, a rare operation, that's difficult to inspect before it
flies up to crates.io

I think because publish is a rare operation, we can make it much noisier than usual. Like, we can copy all the files we are going to publish to a temp directory, together with rewritten toml, show the listing of files we are going to publish (were there some complaints that cargo accidently published stuff that wasn't supposed to be published and vice verse?), print some information about rewriting path deps to usual deps, and ask a conformation?

@matklad
Copy link
Member

matklad commented May 12, 2017

Now that doesn't meant that we couldn't support a "patch" format of some kind (maybe not literally a patch file), but it means that it'd become a first-class feature of Cargo that everyone could use everywhere, which I'm not sure we want to do just yet...

One more alternative is to add a single very specific boolean option to Cargo.toml (like, appending published = true to the end of the file), which basically says: "ignore path deps when reading this file".

@matklad
Copy link
Member

matklad commented May 12, 2017

On the other hand, this means that this feature will be rarely tested in the wild, and some bugs may remain uncovered.

Ah, I missed the part that now we verify unmodified package! That should help a lot.

Overall, I think we should definitely do this! The only thing that makes me nervous is implementaion: there's a lot of code there, and most of the code is identity function. Perhaps we can operate on untyped toml Table directly and modify only the path we are interested in? That way, we wouldn't need to change TOML frobbing code when adding new keys to the manifest.

@alexcrichton
Copy link
Member Author

Am I correct that this rewriting should not affect anything besides the literal "cargo build downloaded crate" use case

Not 100%. We do indeed ignore the Summary portion of Cargo.toml, but we still parse Cargo.toml from the registry downloads (for normal crates.io builds) to learn about targets. Stuff like "is this a proc macro?" or "where is the library entry point file?" In that sense the targets are critical to preserve, the dependencies less so for normal operation.

I think because publish is a rare operation, we can make it much noisier than usual

I definitely agree with this, we've got a lot of leeway here. I'm wary to add interactivity which might break scripts, but I think a saving grace of ours is the sanity check of "cargo build" just before you publish. The modifications here make this almost a cargo build modulo how the workspace is handled, but it should in theory be a good check that the crate at least builds.

Do you think that's enough? Or do you think we should add something else? I'm wary to say "please look at this file and make sure it's right" because it's a pretty ugly file (not a pretty toml, a fully explicit toml) so it will look alien to most users.

One more alternative is to add a single very specific boolean option to Cargo.toml (like, appending published = true to the end of the file), which basically says: "ignore path deps when reading this file".

Hm that's an interesting idea yeah. Something that basically makes it as uninvasive as possible. I'm still a little wary of doing this though as it's a new public-facing feature unlike this implementation which is entirely an internal detail (in theory).

One thought that's also been rolling around in the back of my head is that there may not always be an obvious and/or canonical desugaring for a manifest. Right now we drop [workspace] and path dependencies, but the path dependencies are specifically rewritten to crates.io and more generally the registry being published to.

Basically I'm thinking that with multi-registry support Cargo may not know via simply a boolean flag where a path dependency is supposed to point, it may or may not be crates.io.

only thing that makes me nervous is implementaion: there's a lot of code there, and most of the code is identity function. Perhaps we can operate on untyped toml Table directly and modify only the path we are interested in?

My hope here was to use the literall 100% same exact structures for serialization and deserialization. That at least provides the guarantee that we won't forget to implement any key we add support for or something like that. It definitely means it's a bit trickier, however, because a lot of the deserialization business is done mostly for ergonomics (lots of options, multi variants, etc).

The downside is definitely that it's a relatively large wad of code, and some of the translation is non-obvious. Primarily the inference from a list of Target structures back to the [lib] etc sections is pretty scary...

@alexcrichton
Copy link
Member Author

Oh I've also pushed an update which writes in Cargo.toml.orig with the exact contents of the original manifest.

@matklad
Copy link
Member

matklad commented May 15, 2017

Primarily the inference from a list of Target structures back to the [lib] etc sections is pretty scary...

Yeah. Why from_package_for_registry has a signature Package -> TomlManifest. Can we make it TomlManifest -> TomlManifest?

@alexcrichton
Copy link
Member Author

Oh I don't mind either way, we need to go from a Package to TomlManifest somewhere, and there's not really a lot of usage in having an ephemeral TomlManifest, so I figured we'd short-circuit.

@cuviper
Copy link
Member

cuviper commented May 15, 2017

Could you run into cases where the Package is less complete than the raw toml? e.g. missing a field for some forward-compatible new feature, which probably shouldn't just be dropped when an older cargo does the rewriting.

@alexcrichton
Copy link
Member Author

Yes that's true I think. Basically if you use an older Cargo to publish a newer crate, it'd drop fields it didn't know about.

The alternative, though, I see as sort of just as bad. If we just trawl through the TOML representation deleting, for example, path keys and workspace keys, then we (a) may delete too much or (b) may forget to delete something crucial. For example if we add a new [foo-dependencies] section it seems very easy to forget to delete path keys from that section, whereas today it'd be a compiler error.

Does that make sense? I think I'd slightly prefer the strategy with compiler errors and basically recommend you publish w/ recent versions of Cargo, which typically is what you're doing anyway as you likely just ran cargo test before or some time before publication.

@alexcrichton alexcrichton force-pushed the rewrite-cargo-toml branch 2 times, most recently from 7eb5aaa to 7989b15 Compare May 15, 2017 19:59
@matklad
Copy link
Member

matklad commented May 15, 2017

Oh I don't mind either way, we need to go from a Package to TomlManifest somewhere,

Can we reparse toml file into TomlManifest? It seems to me that Package -> Toml is actually the most laborious step here, and not the actual rewrite of stuff. And because TomlManifest is typed, you'd still get compile time errors.

What do you think about adding a test that rewrites a Cargo.toml and checks the resulting file literally? I understand that this should already be integration-tested by our packaging tests, but having a before-after style test would probably help with checking some edge cases, should they become apparent.

@alexcrichton
Copy link
Member Author

@matklad hm I'm not 100% sure what you're asking for actually, mind elaborating? I'm not sure what you're thinking is transitioning to what?

I wouldn't mind though, yeah, adding a byte-for-byte check somewhere of the manifests that are generated.

@matklad
Copy link
Member

matklad commented May 16, 2017

@matklad hm I'm not 100% sure what you're asking for actually, mind elaborating?

Sure!

We have an original Package and want to get a rewritten TomlManifest. This can be represented as a composition of two functions: to_toml_manifest: Package -> TomlManifest and rewrite: TomlManifest -> TomlManifest. In the current implementation in from_package_for_registry we do both steps simultaneously.

Looks like its to_toml_manifest part that is the most difficult! And in some sense it seems not really necessary: we've got a Package from TomlManifest originally, so why do we have to write the conversion in the opposite direction, if we can reuse the exact original manifest?

I think we can trivialize to_toml_manifest, if we don't try to recreate TomlManifest from Package literally, as we do now, but instead use the same code we already have for creating TomlManifests, namely, parsing a Cargo.toml file. In pseudorust:

fn to_registry_toml(pkg: Pacakge) -> String {
    let toml_text = fs::File(pkg.manifest_path).read_to_string();
    let toml_value: toml::Value = parse(toml_text);
    let toml: TomlManifest = serde::deserialize(toml_value); // here we are getting `TomlManifest` for free
    rewrite(toml).to_string()
}

Hm, now that I've spelled this out I see one more way to get TomlManifest for free: we can add a original_manifest field to the Package:

struct Package {
  original_manifest: TomlManifest
}

Anyways, this is only an implementation detail and it does not matter much if tested thoroughly :) If you think the current implementation is sufficiently best, let's add a byte-for-byte check and merge this.

@alexcrichton
Copy link
Member Author

Ok I think I like that strategy more. It still doesn't handle @cuviper's concerns about how it drops unknown keys from the manifest (e.g. ones added in future Cargos) but it makes me more comfortable by preserving the original keys and such.

@alexcrichton
Copy link
Member Author

@matklad ok I've updated with said strategy

bench: self.bench.clone(),
dependencies: map_deps(&self.dependencies),
dev_dependencies: map_deps(&self.dev_dependencies),
dev_dependencies2: map_deps(&self.dev_dependencies2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we may normalize dev_dependencies2 into dev_dependencies at this step?

Even if we don't plan to drop support for the second variant, having only one form may be useful if, for example, someone greps ~/.cargo or otherwise interacts with crates without Cargo itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure yeah, makes sense to me, pushed up a variant of that!

tests/package.rs Outdated
exclude = ["*.txt"]
description = "foo"
license = "MIT"
"#));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a little bit less trivial and add [workspace] key and a path dependency, just to make sure that identity function won't pass this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh good point, looks like I forgot to delete package.workspace as well, handled now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does that mean that we don't test publising workspaced crates? The old code should have failed on publishing, shouldn't it?

This commit is an implementation of rewriting TOML manifests when we publish
them to the registry. The rationale for doing this is to provide a guarantee
that downloaded tarballs from crates.io can be built with `cargo build`
(literally). This in turn eases a number of other possible consumers of crates
from crates.io

* Vendored sources can now be more easily modified/checked as cargo build should
  work and they're standalone crates that suffice for `path` dependencies
* Tools like cargobomb/crater no longer need to edit the manifest and can
  instead perform regression testing on the literal tarballs they download
* Other systems such as packaging Rust code may be able to take advantage of
  this, but this is a less clear benefit.

Overall I'm hesitatnt about this, unfortunately. This is a silent translation
happening on *publish*, a rare operation, that's difficult to inspect before it
flies up to crates.io. I wrote a script to run this transformation over all
crates.io crates and found a surprisingly large number of discrepancies. The
transformation basically just downloaded all crates at all versions,
regenerated the manifest, and then tested if the two manifests were (in memory)
the same.

Unfortunately historical Cargo had a critical bug which I think made this
exercise not too useful. Cargo used to *not* recreate tarballs if one already
existed, which I believe led to situations such as:

1. `cargo publish`
2. Cargo generates an error about a dependency. This could be that there's a
   `version` not present in a `path` dependency, there could be a `git`
   dependency, etc.
3. Errors are fixed.
4. `cargo publish`
5. Publish is successful

In step 4 above historical Cargo *would not recreate the tarball*. This means
that the contents of the index (what was published) aren't guaranteed to match
with the tarball's `Cargo.toml`. When building from crates.io this is ok as the
index is the source of truth for dependency information, but it means that *any*
transformation to rewrite Cargo.toml is impossible to verify against all crates
on crates.io (due to historical bugs like these).

I strove to read as many errors as possible regardless, attempting to suss out
bugs in the implementation here. To further guard against surprises I've updated
the verification step of packaging to work "normally" in these sense that it's
not rewriting dependencies itself or changing summaries. I'm hoping that this
serves as a good last-ditch effort that what we're about to publish will indeed
build as expected when uploaded to crates.io

Overall I'm probably 70% confident in this change. I think it's necessary to
make progress, but I think there are going to be very painful bugs that arise
from this feature. I'm open to ideas to help weed out these bugs ahead of time!
I've done what I can but I fear it may not be entirely enough.

Closes rust-lang#4027
@matklad
Copy link
Member

matklad commented May 18, 2017

@alexcrichton looks great now!

@bors r+

@bors
Copy link
Contributor

bors commented May 18, 2017

📌 Commit 57db8bd has been approved by matklad

@bors
Copy link
Contributor

bors commented May 18, 2017

⌛ Testing commit 57db8bd with merge 3973598...

bors added a commit that referenced this pull request May 18, 2017
Rewrite Cargo.toml when packaging crates

This commit is an implementation of rewriting TOML manifests when we publish
them to the registry. The rationale for doing this is to provide a guarantee
that downloaded tarballs from crates.io can be built with `cargo build`
(literally). This in turn eases a number of other possible consumers of crates
from crates.io

* Vendored sources can now be more easily modified/checked as cargo build should
  work and they're standalone crates that suffice for `path` dependencies
* Tools like cargobomb/crater no longer need to edit the manifest and can
  instead perform regression testing on the literal tarballs they download
* Other systems such as packaging Rust code may be able to take advantage of
  this, but this is a less clear benefit.

Overall I'm hesitatnt about this, unfortunately. This is a silent translation
happening on *publish*, a rare operation, that's difficult to inspect before it
flies up to crates.io. I wrote a script to run this transformation over all
crates.io crates and found a surprisingly large number of discrepancies. The
transformation basically just downloaded all crates at all versions,
regenerated the manifest, and then tested if the two manifests were (in memory)
the same.

Unfortunately historical Cargo had a critical bug which I think made this
exercise not too useful. Cargo used to *not* recreate tarballs if one already
existed, which I believe led to situations such as:

1. `cargo publish`
2. Cargo generates an error about a dependency. This could be that there's a
   `version` not present in a `path` dependency, there could be a `git`
   dependency, etc.
3. Errors are fixed.
4. `cargo publish`
5. Publish is successful

In step 4 above historical Cargo *would not recreate the tarball*. This means
that the contents of the index (what was published) aren't guaranteed to match
with the tarball's `Cargo.toml`. When building from crates.io this is ok as the
index is the source of truth for dependency information, but it means that *any*
transformation to rewrite Cargo.toml is impossible to verify against all crates
on crates.io (due to historical bugs like these).

I strove to read as many errors as possible regardless, attempting to suss out
bugs in the implementation here. To further guard against surprises I've updated
the verification step of packaging to work "normally" in these sense that it's
not rewriting dependencies itself or changing summaries. I'm hoping that this
serves as a good last-ditch effort that what we're about to publish will indeed
build as expected when uploaded to crates.io

Overall I'm probably 70% confident in this change. I think it's necessary to
make progress, but I think there are going to be very painful bugs that arise
from this feature. I'm open to ideas to help weed out these bugs ahead of time!
I've done what I can but I fear it may not be entirely enough.

Closes #4027
@bors
Copy link
Contributor

bors commented May 18, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: matklad
Pushing 3973598 to master...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite Cargo.toml on publication to crates.io
7 participants