From f4706550b6df133915a293f5bc3144efeccb9007 Mon Sep 17 00:00:00 2001 From: John Ericson Date: Wed, 12 Jul 2023 09:31:37 -0400 Subject: [PATCH] [RFC 0133] Git hashing and Git-hashing-based remote stores (#133) * ipfs: Copy Template * ipfs: Start drafting * ipfs: Finish draft * ipfs: Expand discussion of managing complexity * ipfs: Fix typos Thanks! * ipfs: Fix more typos Thanks! * ipfs: FInish motivation on source distribution and archival * ipfs: Rename now that we have number * Apply suggestions from code review Thanks! Co-authored-by: Kevin Cox * Fix typos Thanks! Co-authored-by: Adam Joseph <54836058+amjoseph-nixpkgs@users.noreply.github.com> * 133: Add shepherd team! Co-authored-by: Eelco Dolstra * 133: Fix shepherds list mjoseph -> amjoseph * 133: Move non-`git` steps to future work * 133: Move one more section out of future work * 133: Move IPFS-specific motivation to future work too * 133: Rename feature in light of changes * 133: Rename RFC in light of changes * 133: Discuss the downside of git's file system model being different * Split future work, clean up Nix-agnostic stores section * Fix numerious typos Thanks, all of you! Co-authored-by: Kevin Cox Co-authored-by: Adam Joseph <54836058+amjoseph-nixpkgs@users.noreply.github.com> Co-authored-by: Linus Heckemann * Add RFC open PR date * Be clearer about not supporting references to start * Update rfcs/0133-git-hashing.md Co-authored-by: Kevin Cox * Rip out both RFC-scal Future Work sections They are now in an `ipfs-2` branch in this repo. * Remove "Build adoption through seamless interop" That can go in a separate blog post. * Apply suggestions from code review Thank you both!! Co-authored-by: Valentin Gagarin Co-authored-by: Ryan Lahfa * Slim down the layering section The other stuff is already in flight, we don't need to talk about it so much here. Co-authored-by: Valentin Gagarin --------- Co-authored-by: Kevin Cox Co-authored-by: Adam Joseph <54836058+amjoseph-nixpkgs@users.noreply.github.com> Co-authored-by: Eelco Dolstra Co-authored-by: Linus Heckemann Co-authored-by: Valentin Gagarin Co-authored-by: Ryan Lahfa --- rfcs/0133-git-hashing.md | 164 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) create mode 100644 rfcs/0133-git-hashing.md diff --git a/rfcs/0133-git-hashing.md b/rfcs/0133-git-hashing.md new file mode 100644 index 000000000..81eb9f192 --- /dev/null +++ b/rfcs/0133-git-hashing.md @@ -0,0 +1,164 @@ +--- +feature: git-hashing +start-date: 2022-08-27 +author: John Ericsion (@Ericson2314) on behalf of [Obsidian Systems](https://obsidian.systems) +co-authors: (find a buddy later to help out with the RFC) +shepherd-team: edolstra, kevincox, gador, @amjoseph-nixpkgs +shepherd-leader: amjoseph-nixpkgs +related-issues: (will contain links to implementation PRs) +--- + +# Summary +[summary]: #summary + +Integrate Git hashing with Nix. + +Nix should support content-addressed store objects using git blob + tree hashing, and Nix-unaware remote stores that serve git objects. + +This follows the work done and described in https://github.com/obsidiansystems/ipfs-nix-guide/ . + +# Motivation +[motivation]: #motivation + +## Binary distribution + +Currently distributing Nix binaries takes a lot of bandwidth and storage. +This is a barrier to being a Nix user in areas of slower internet --- which includes the vast majority of the world's population at this time. +This is also a barrier to users running their own caches. + +Content-addressing opens up a *huge* design space of solutions to get around such problems. + +The first steps proposed below do *not* tackle this problem directly, but it lays the ground-work for future experiments in this direction. + +## Source distribution and archival + +Source code used by Nix expressions frequently goes off-line. It would be beneficial if there was some resistance to this form of bitrot. +The Software Heritage archive stores much of the source code that Nix expressions use. They would be a natural partner in this effort. + +Unfortunately, as https://www.tweag.io/blog/2020-06-18-software-heritage/ describes at the end, a major challenge is the way Nix content-addresses software. +First of all, Nix hashes sources in bespoke ways that no other project will adopt. +Second of all, hashing tarballs instead of the underlying files leads to non-normative details (compression, odd perms, etc.). + +We should natively support Git file hashing, which is supported both by Git repos and Software Heritage. +This will completely obliterate these issues. + +Overall, we are building out a uniform way to work with source code, regardless of its origins or the exact tools involved. + +# Detailed design +[design]: #detailed-design + +Each item can be done separately provided its dependent items are also done. +These are the items we wish to commit to at this time. +(The goals mentioned under [future work](#future-work) are, in a separate document, also broken down into a dependency graph of smaller steps.) + +## Git file hashing + +- **Purpose**: Source distribution and archival + +In addition to the various forms of content-addressing Nix supports today ("text", "fixed" with either "flat" or "nar" serialization of file system objects), Nix should support Git hashing. +This support entails two basic things: + + - Content addresses are used to compute store paths. + - Content addresses are used to verify store object integrity. + +Git hashing would not (in this first proposed version) support references, since references in Nix's sense are not part of Git's data model. +This is OK for now; encoding references is not needed for the intended initial use-case of exchanging source code. + +## Git file hashing for `buitins.fetch*` + +- **Purpose**: Source distribution and archival +- **Depends on**: Git file hashing + +The built-in fetchers can also be made to work with Git file hashing just as they support the other types. +In addition, Git repo fetching can leverage this better to than the other formats since the data in Git repos is already content-addressed in this way. + +## Nix-agnostic content-addressing "stores" + +- **Purpose**: All distribution + +We want to be able to substitute from an arbitrary store (in the general, non-Nix sense) of content-addressed objects. +For the purpose of this RFC, that means querying objects by Git hash, and being able to trust the results because we can verify them against the Git hash. + +In the implementation, we could accomplish this in a variety of ways. + +- On one extreme, we could have a `ContentAddressedSubstitutor` abstract interface completely separate from Nix's `Store` interface. + +- On the other extreme, we can generalize `Store` itself to allow taking content addresses or store paths as references. + +Exactly how this shakes out is to be determined post-RFC, but it would be nice to use Nix-agnostic persistent methods with `--store` and `--substituters`. + +If we do go the route of modifying the `Store` class, note that these things will need to happen: + + - Many store interface methods that today take store paths will need to also accept names & content address pairs. + + For stores that are purpose-built for Nix, like the ones we support today, all addressing can be done with store paths, so the current interface is fine. + But for Nix-agnostic stores, store paths are rather useless as a key type because Nix-agnostic tools don't know about them. + Those store can, however, understand content addresses. + And from such a name + content address, we can always produce a store path again, so there is no loss of functionality with existing stores. + +- Relax `ValidPathInfo` to merely require that *either* the pair of `NarHash` and `NarSize` or just `CA` alone be defined. + + As described in the first step, currently `NarHash` and `NarSize` are the *normative* fields which are used to verify a store object. + But if the store object is content-addressed, we don't need these, because the content address (`CA` field) will also suffice, all by itself. + + Existing Nix stores types are still required to contain a `NarHash` and `NarSize`, which is good for backwards compatibility and don't come with a cost. + Only new Nix-agnostic store types would take advantage of these new, relaxed rules. + +# Examples and Interactions +[examples-and-interactions]: #examples-and-interactions + +We encourage anyone interested to check our tutorial in https://github.com/obsidiansystems/ipfs-nix-guide/ which demonstrates the above functionality. +Note at the time of writing this guide uses our original 2020 fork of Nix. + +# Drawbacks +[drawbacks]: #drawbacks + +## Complexity + +The main cost is more complexity to the store layer. +For a few reasons we think this is not so bad. + +Most importantly is the division of the work into a dependency graph of steps. +This allows us to slowly try out things like IPFS that leverage Git hashing, and not commit to more change than we want to up front. + +Even if we do end up adopting everything though, we think for the following two reasons the complexity can still be kept manageable: + +1. Per the abstract vs concrete model of the Nix store in https://github.com/NixOS/nix/pull/6877, everything we are doing is simply flushing out alternative interpretations of the abstract model. + This is the sense in which we are, per the Scheme mantra, "removing the weaknesses and restrictions that make additional features appear necessary": + Instead of extending the model with new features, we are relaxing concrete model assumptions (e.g. references are always opaque store paths) while keeping the abstract model the same. + +2. We also support plans to decouple the layers of Nix further, and update our educational and marketing material to reflect it. + Layering will "divide and conquer" the project so the interfaces between each layer are still rigorously enforced preventing a combinatorial explosion in complexity. + That frees up "complexity budget" for projects like this. + +## Git and Nix's file system data models do not entirely coincide + +Nix puts the permission info of a file (executable bit for now) with that file, whereas Git puts it with the name and hash in the directory. +The practical effect of this discrepancy is that a root file (as opposed to directory) in Nix has permission info, but does not in Git. + +If we are trying to convert existing Nix data into Git, this is a problem. +Assuming we treat "no permission bits" as meaning "non-executable", we will have a partial conversion that will fail on executable files without a parent directory. +Tricks like always wrapping everything in a directory get around this, but then we have to be careful the directory is exactly as expected when "unwrapping" in the other direction. + +For now, we only focus on ingesting data *from* Git *to* Nix, and this side-steps the issue. +That mapping is total, i.e. all Git data can be mapped, and injective, i.e. each Git data has a unique Nix data representative (though not surjective, i.e. not all Nix data can be represented as a piece of Git data), and so there is no problem for now. + +# Alternatives +[alternatives]: #alternatives + +The dependency graph of steps can be sliced to save some for future work. +For now they are all written together, but during the RFC meetings we will decide which steps (if any) to ratify now, and which steps to save for later. + +# Unresolved questions +[unresolved]: #unresolved-questions + +None at this time. + +# Future work +[future]: #future-work + +- Integrate with outside content-addressing storage/transmission like + + - The Software Heritage archive + + - IPFS