-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support local mirrors of registries, take 2 #2857
Conversation
@alexcrichton: no appropriate reviewer found, use r? to override |
This is a rebasing of #2361 with directory support added in, I'll continue to follow up with more comments in a bit (hit the "make a PR" button a bit soon...) |
Ok, cc @rust-lang/tools, a significant addition to Cargo with lots of room to grow as well. This is likely the start of Cargo supporting multiple registries. A next major work item here is to actually make it possible to bring your own mirror registry of crates.io online. That is, write the server which mirrors crates.io and otherwise ships crates to you. Most of that shouldn't be too hard, but it'll be hard to get it nice and ergonomic the first time around I believe. cc @luser, this is the support in Cargo necessary for vendoring dependencies in Gecko. The workflow we imagine is:
|
350a1b3
to
46a239f
Compare
So just so I'm clear, the local registry directory layout should look like:
And the plan is to nail this down as ABI so people can depend on this layout to remain constant? |
@cardoe indeed! Actually that reminds me that I forgot to write documentation on this in I've previously written |
This all sounds great. The directory source should meet our needs exactly.
This seems pretty reasonable. We'll have to nail down (and document) the process for making changes to vendored libraries. I know you or @wycats mentioned a potential |
@alexcrichton I think I'm starting to understand this better :). If the section is called |
It wasn't clear to me, but looking at some of the test cases it doesn't look like the I'd also like to see |
Yeah definitely, I'll try to write up documentation for all this, especially on the expected workflows. We can also chat more directly to make sure we're comfortable with it all.
Unfortunately, no. To replace a source with another the checksum property (whether or not the source has checksums) must be the same. Registries have checksums, but git repositories and path sources do not. Vendored sources do, however.
Yeah I was thinking we may want to do something like that. Of course we could also just say that if it starts with a bunch of unknown hex characters it's a sha256 hash, so it's not necessarily super pressing. |
☔ The latest upstream changes (presumably #2858) made this pull request unmergeable. Please resolve the merge conflicts. |
@alexcrichton Could those checksums be generated on the fly? If the checksum is generated by |
@alexcrichton Saying "if it starts with a bunch of unknown hex characters it's a sha256 hash" worries me.
Please be explicit here in the metadata format to allow for future changes without any guessing. The hash types used here will need to change in the future, so it would be better to avoid having to keep around a separate legacy "if there's no identifier yolo sha256" code path. See pip for a good example of package verification. |
This looks excellent, and should work well for Debian packaging. Regarding the local-registry index, could Cargo support reading the index from multiple files, in an index.d directory, if one exists? That would allow each crate to install its associated index entry. Without that, Debian would have to have a trigger that updates the index for every installed crate package. Looking at the index format, it comes close to supporting this, but still requires a list of versions of each crate within a single file, which would make it harder to package each crate version separately. If each crate version could install a separate file, then packages of crate versions wouldn't need to run any code at installation time. |
@alexcrichton Moreover, for this NixOS case, the idea is what every source, not just crates.io, would need to be mirrored (nix would fetch all git repos, etc, so that Cargo would work without network access). So being able to mirror any source (not just registries), is crucial from our cases. This is technically unrelated to being able to mirror a source with any sort of source, but as a practical matter, it would be much easier to replace a git source with its local clone than have to create a single-repo directory registry. CC NixOS/nixpkgs#11144 @wycats ^ basically but we were talking about re nix. |
@joshtriplett Wouldn't the directory-registry work better for distro use? That's what I was leaning toward, then no index management is needed. The -devel packages would just drop their source into that shared root. |
@cuviper That makes more sense to me for a distro use case. The only thing this runs into trouble with is if people expect to carry different patches for the same build dependency across different packages which seems to be wrong in my mind. |
@bryteise I would expect those people to hash out their differences, just like they must for any other common dependency. IOW multiple sources shouldn't be dropping the same crate in that path, just one shared source that has to work for everyone. |
No, I described this in "Directory registries" in the PR description.
Yes, this is intended to be extensible for things like git repos as well, although it'd have to be specified once per git repo as well.
Could you elaborate what you mean by this? I'm unfortunately not sure what index.d is :(. Right now it's kinda nice sharing all the index handling logic between the remote and local registries, and it'd be a bit of a shame to lose but perhaps not the worst! |
Usually "foo.d" is a directory of individual files to augment the main one. Like how there's |
@alexcrichton you mean this paragraph?
I don't see why those files can't just be hashed on a fly. If the lockfile doesn't already exist, we are already assuming that the mirror reflects the original---precomputed hashes do not get around the fact that this is a matter of trust. If the lockfile does exist, the on-the-fly hashing is verified against that (Cargo computes the whole Merkel dag). I'm not saying we shouldn't have a directory repository---all that recursive hashing is slower---but I don't see a security problem. |
@cuviper @alexcrichton The compressed .crate files seemed nice to save space, but directory registries would work fine in distro packages too. The one downside would be the inability to match hashes directly with upstream, relying on the per-file hashes that (as I understand it) upstream doesn't directly provide. However, that property applies to every other package in the distribution, too, so it doesn't really matter. The index.d concept was that each crate package could drop its portion of the index into a directory, rather than needing to have a single file that multiple crate packages would need to update. But if directory registries don't need that index at all, then that seems like the preferred alternative. |
@Ericson2314 It isn't a security problem; it's more that it stops casual "oh, I'll just patch the package" local hacks, which break the whole assumption of reproducible builds where version X of crate C always refers to the same crate contents. |
@joshtriplett Sure it's a not a real security problem. I'm just trying to get at what's worse about using the existing sources for mirrors. The only thing I could come up with is the slowness of hashing more things. |
Distros should be allowed to patch these things though! |
@cuviper Absolutely. But they shouldn't patch version X of crate C and still call it version X of crate C; at that point, it's, for instance, Debian package version X-2 (X-3, X-4, ...) of crate C. |
This commit changes how lock files are encoded by checksums for each package in the lockfile to the `[metadata]` section. The previous commit implemented the ability to redirect sources, but the core assumption there was that a package coming from two different locations was always the same. An inevitable case, however, is that a source gets corrupted or, worse, ships a modified version of a crate to introduce instability between two "mirrors". The purpose of adding checksums will be to resolve this discrepancy. Each crate coming from crates.io will now record its sha256 checksum in the lock file. When a lock file already exists, the new checksum for a crate will be checked against it, and if they differ compilation will be aborted. Currently only registry crates will have sha256 checksums listed, all other sources do not have checksums at this time. The astute may notice that if the lock file format is changing, then a lock file generated by a newer Cargo might be mangled by an older Cargo. In anticipation of this, however, all Cargo versions published support a `[metadata]` section of the lock file which is transparently carried forward if encountered. This means that older Cargos compiling with a newer lock file will not verify checksums in the lock file, but they will carry forward the checksum information and prevent it from being removed. There are, however, a few situations where problems may still arise: 1. If an older Cargo takes a newer lockfile (with checksums) and updates it with a modified `Cargo.toml` (e.g. a package was added, removed, or updated), then the `[metadata]` section will not be updated appropriately. This modification would require a newer Cargo to come in and update the checksums for such a modification. 2. Today Cargo can only calculate checksums for registry sources, but we may eventually want to support other sources like git (or just straight-up path sources). If future Cargo implements support for this sort of checksum, then it's the same problem as above where older Cargos will not know how to keep the checksum in sync
Add an abstraction over which the index can be updated and downloads can be made. This is currently implemented for "remote" registries (e.g. crates.io), but soon there will be one for "local" registries as well.
This flavor of registry is intended to behave very similarly to the standard remote registry, except everything is contained locally on the filesystem instead. There are a few components to this new flavor of registry: 1. The registry itself is rooted at a particular directory, owning all structure beneath it. 2. There is an `index` folder with the same structure as the crates.io index describing the local registry (e.g. contents, versions, checksums, etc). 3. Inside the root will also be a list of `.crate` files which correspond to those described in the index. All crates must be of the form `name-version.crate` and be the same `.crate` files from crates.io itself. This support can currently be used via the previous implementation of source overrides with the new type: ```toml [source.crates-io] replace-with = 'my-awesome-registry' [source.my-awesome-registry] local-registry = 'path/to/registry' ``` I will soon follow up with a tool which can be used to manage these local registries externally.
This flavor of source is intended to behave like a local registry except that its contents are unpacked rather than zipped up in `.crate` form. Like with local registries the only way to use this currently is via the `.cargo/config`-based source replacement currently, and primarily only to replace crates.io or other registries at the moment. A directory source is simply a directory which has many `.crate` files unpacked inside of it. The directory is not recursively traversed for changes, but rather it is just required that all elements in the directory are themselves directories of packages. This format is more suitable for checking into source trees, and it still provides guarantees around preventing modification of the original source from the upstream copy. Each directory in the directory source is required to have a `.cargo-checksum.json` file indicating the checksum it *would* have had if the crate had come from the original source as well as all of the sha256 checksums of all the files in the repo. It is intended that directory sources are assembled from a separately shipped subcommand (e.g. `cargo vendor` or `cargo local-registry`), so these checksum files don't have to be managed manually. Modification of a directory source is not the intended purpose, and if a modification is detected then the user is nudged towards solutions like `[replace]` which are intended for overriding other sources and processing local modifications.
2d15570
to
2eda182
Compare
Yeah you're right in that we load up checksums for all the crates in a directory registry. Not doing so would require an index similar to the local registry index. The intent of directory sources are for vendoring, not distros, and in the case of vendoring you're going to load all the crates anyway to build them so the overhead shouldn't be much. Also yeah, this isn't preventing any sort of "malicious activity" wrt directories. If you change a file you can change the checksum in the json metadata. The intent is to prevent accidental updates, not prevent malicious modifications. The core of the source replacement mechanism is that you replace a crate with the exact same code , just from a different location. If directory sources had no checksums at all then they couldn't provide that guarantee, but this allows them to at least provide a good enough guarantee along those lines for our purposes. |
2eda182
to
772c15c
Compare
772c15c
to
63ac9e1
Compare
Ok, @brson I believe I should have addressed all your comments and I've also pushed a commit containing documentation for this to go on doc.crates.io |
@bors r+ |
📌 Commit 63ac9e1 has been approved by |
Add support local mirrors of registries, take 2 This series of commits culminates in first class support in Cargo for local mirrors of registries. This is implemented through a number of other more generic mechanisms, and extra support was added along the way. The highlights of this PR, however, are: New `.cargo/config` keys have been added to enable *replacing one source with another*. This functionality is intended to be used for mirrors of the main registry or otherwise one to one source correspondences. The support looks like: ```toml [source.crates-io] replace-with = 'my-awesome-registry' [source.my-awesome-registry] registry = 'https://github.com/my-awesome/registry-index' ``` This configuration means that instead of using `crates-io` (e.g. `https://github.com/rust-lang/crates.io-index`), Cargo will query the `my-awesome-registry` source instead (configured to a different index here). This alternate source **must be the exact same as the crates.io index**. Cargo assumes that replacement sources are exact 1:1 mirrors in this respect, and the following support is designed around that assumption. When generating a lock file for crate using a replacement registry, the *original registry* will be encoded into the lock file. For example in the configuration above, all lock files will still mention crates.io as the registry that packages originated from. This semantically represents how crates.io is the source of truth for all crates, and this is upheld because all replacements have a 1:1 correspondance. Overall, this means that no matter what replacement source you're working with, you can ship your lock file to anyone else and you'll all still have verifiably reproducible builds! With the above support for custom registries, it's now possible for a project to be downloading crates from any number of sources. One of Cargo's core goals is reproducible builds, and with all these new sources of information it may be easy for a few situations to arise: 1. A local replacement of crates.io could be corrupt 2. A local replacement of crates.io could have made subtle changes to crates In both of these cases, Cargo would today simply give non-reproducible builds. To help assuage these concerns, Cargo will now track the sha256 checksum of all crates from registries in the lock file. Whenever a `Cargo.lock` is generated from now on it will contain a `[metadata]` section which lists the sha256 checksum of all crates in the lock file (or `<none>` if the sha256 checksum isn't known). Cargo already checks registry checksums against what's actually downloaded, and Cargo will now verify between iterations of the lock file that checksums remain the same as well. This means that if a local replacement registry is **not** in a 1:1 correspondance with crates.io, the lock file will prevent the build from progressing until the discrepancy is resolved. In addition to the support above, there is now a new kind of source in Cargo, a "local registry", which is intended to be a subset of the crates.io ecosystem purposed for a local build for any particular project here or there. The way to enable this looks like: ```toml [source.crates-io] replace-with = 'my-awesome-registry' [source.my-awesome-registry] local-registry = 'path/to/my/local/registry' ``` This local registry is expected to have two components: 1. A directory called `index` which matches the same structure as the crates.io index. The `config.json` file is not required here. 2. Inside the registry directory are any number of `.crate` files (downloaded from crates.io). Each crate file has the name `<package>-<version>.crate`. This local registry must currently be managed manually, but I plan on publishing and maintaining a Cargo subcommand to manage a local registry. It will have options to do things like: 1. Sync a local registry with a `Cargo.lock` 2. Add a registry package to a local registry 3. Remove a package from a local registry In addition to local registries, Cargo also supports a "directory source" like so ```toml [source.crates-io] replace-with = 'my-awesome-registry' [source.my-awesome-registry] directory = 'path/to/some/sources' ``` A directory source is similar to a local registry above, except that all the crates are unpacked and visible as vendored source. This format is suitable for checking into source trees, like Gecko's. Unlike local registries above we don't have a tarball to verify the crates.io checksum with, but each vendored dependency has metadata containing what it *would* have been. To further prevent modifications by accident, the metadata contains the checksum of each file which should prevent accidental local modifications and steer towards `[replace]` as the mechanism to edit dependencies if necessary. This is quite a bit of new features! What's all this meant to do? Some example scenarios that this is envisioned to solve are: 1. Supporting mirrors for crates.io in a first class fashion. Once we have the ability to spin up your own local registry, it should be easy to locally select a new mirror. 2. Supporting round-robin mirrors, this provides an easy vector for configuration of "instead of crates.io hit the first source in this list that works" 3. Build environments where network access is not an option. Preparing a local registry ahead-of-time (from a known good lock file) will be a vector to ensure that all Rust dependencies are locally available. * Note this is intended to include use cases like Debian and Gecko Even with the new goodies here, there's some more vectors through which this can be expanded: * Support for running your own mirror of crates.io needs to be implemented to be "easy to do". There should for example be a `cargo install foo` available to have everything "Just Work". * Replacing a source with a list of sources (attempted in round robin fashion) needs to be implemented * Eventually this support will be extended to the `Cargo.toml` file itself. For example: * packages should be downloadable from multiple registries * replacement sources should be encodable into `Cargo.toml` (note that these replacements, unlike the ones above, would be encoded into `Cargo.lock`) * adding multiple mirrors to a `Cargo.toml` should be supported * Implementing the subcommand above to manage local registries needs to happen (I will attend to this shortly)
☀️ Test successful - cargo-cross-linux, cargo-linux-32, cargo-linux-64, cargo-mac-32, cargo-mac-64, cargo-win-gnu-32, cargo-win-gnu-64, cargo-win-msvc-32, cargo-win-msvc-64 |
Several updates to token/index handling. This attempts to tighten up the usage of token/index handling, to prevent accidental leakage of the crates.io token. * Make `registry.index` config a hard error. This was deprecated 4 years ago in #2857, and removing it helps simplify things. * Don't allow both `--index` and `--registry` to be specified at the same time. Otherwise `--index` was being silently ignored. * `registry.token` is not allowed to be used with the `--index` flag. The intent here is to avoid possibly leaking a crates.io token to another host. * Added a warning if source replacement is used and the token is loaded from `registry.token`. Closes #6545
This series of commits culminates in first class support in Cargo for local mirrors of registries. This is implemented through a number of other more generic mechanisms, and extra support was added along the way. The highlights of this PR, however, are:
Source redirection
New
.cargo/config
keys have been added to enable replacing one source with another. This functionality is intended to be used for mirrors of the main registry or otherwise one to one source correspondences. The support looks like:This configuration means that instead of using
crates-io
(e.g.https://github.com/rust-lang/crates.io-index
), Cargo will query themy-awesome-registry
source instead (configured to a different index here). This alternate source must be the exact same as the crates.io index. Cargo assumes that replacement sources are exact 1:1 mirrors in this respect, and the following support is designed around that assumption.When generating a lock file for crate using a replacement registry, the original registry will be encoded into the lock file. For example in the configuration above, all lock files will still mention crates.io as the registry that packages originated from. This semantically represents how crates.io is the source of truth for all crates, and this is upheld because all replacements have a 1:1 correspondance.
Overall, this means that no matter what replacement source you're working with, you can ship your lock file to anyone else and you'll all still have verifiably reproducible builds!
Adding sha256 checksums to the lock file
With the above support for custom registries, it's now possible for a project to be downloading crates from any number of sources. One of Cargo's core goals is reproducible builds, and with all these new sources of information it may be easy for a few situations to arise:
In both of these cases, Cargo would today simply give non-reproducible builds. To help assuage these concerns, Cargo will now track the sha256 checksum of all crates from registries in the lock file. Whenever a
Cargo.lock
is generated from now on it will contain a[metadata]
section which lists the sha256 checksum of all crates in the lock file (or<none>
if the sha256 checksum isn't known).Cargo already checks registry checksums against what's actually downloaded, and Cargo will now verify between iterations of the lock file that checksums remain the same as well. This means that if a local replacement registry is not in a 1:1 correspondance with crates.io, the lock file will prevent the build from progressing until the discrepancy is resolved.
Local Registries
In addition to the support above, there is now a new kind of source in Cargo, a "local registry", which is intended to be a subset of the crates.io ecosystem purposed for a local build for any particular project here or there. The way to enable this looks like:
This local registry is expected to have two components:
index
which matches the same structure as the crates.io index. Theconfig.json
file is not required here..crate
files (downloaded from crates.io). Each crate file has the name<package>-<version>.crate
.This local registry must currently be managed manually, but I plan on publishing and maintaining a Cargo subcommand to manage a local registry. It will have options to do things like:
Cargo.lock
Directory registries
In addition to local registries, Cargo also supports a "directory source" like so
A directory source is similar to a local registry above, except that all the crates are unpacked and visible as vendored source. This format is suitable for checking into source trees, like Gecko's.
Unlike local registries above we don't have a tarball to verify the crates.io checksum with, but each vendored dependency has metadata containing what it would have been. To further prevent modifications by accident, the metadata contains the checksum of each file which should prevent accidental local modifications and steer towards
[replace]
as the mechanism to edit dependencies if necessary.What's all this for?
This is quite a bit of new features! What's all this meant to do? Some example scenarios that this is envisioned to solve are:
What's next?
Even with the new goodies here, there's some more vectors through which this can be expanded:
cargo install foo
available to have everything "Just Work".Cargo.toml
file itself. For example:Cargo.toml
(note that these replacements, unlike the ones above, would be encoded intoCargo.lock
)Cargo.toml
should be supported