Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix adding paths to the store (try 3) #1754

Closed
wants to merge 2 commits into from
Closed

Conversation

bgamari
Copy link
Contributor

@bgamari bgamari commented Dec 22, 2017

This is a rebased version of the patch originally authored by @Mathnerd314 and proposed for merge in #619. It refactors the addToStore logic to stream data from the file system, avoiding the need to read potentially large files into memory. Note that this patch, unlike the version initially put forth in #619, incurs no dependency on Boost.

So far this is only build tested. However, I would like review on the overall idea.

copyPath(srcPath, dstPath, filter, recursive);

canonicalisePathMetaData(dstPath, -1);
HashResult hash = hashPath(htSHA256, dstPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we copy then hash? How does this differ from the hashPath call above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this hashPath is redundant. Assuming that canonicalisePathMetaData doesn't affect the hash, it seems like we should be able to use the hash h computed above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mathnerd314, do you recall why this second hash was needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment over here. The first hash computes the output path, the second is the SHA256 of the NAR serialization which is used for validity checking. I guess the code could re-use the hash in the case of recursive SHA256, but hashing isn't that expensive.

@@ -412,9 +412,16 @@ Path RemoteStore::addToStore(const string & name, const Path & _srcPath,

auto conn(connections->get());

if (GET_PROTOCOL_MINOR(conn->daemonVersion) < 0x15) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty tight compatibility requirement :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I honestly don't know whether we could improve the compatibility story here. Ideas are certainly welcome.

@dtzWill
Copy link
Member

dtzWill commented Mar 20, 2018

Recent commits (based on earlier version of this I think?) should alleviate much of this, can this be closed or is it still relevant?

@bgamari
Copy link
Contributor Author

bgamari commented Mar 26, 2018

I will happily find time to look into the failures; however, I would first like some feedback on the acceptability of the general approach.

@copumpkin
Copy link
Member

In general it looks fine to me, but I think only @shlevy and @edolstra are really qualified to evaluate deeply.

@shlevy
Copy link
Member

shlevy commented Mar 26, 2018

@bgamari In a general sense I'm very in favor of this, but can you provide a high level view of what this is doing on top of the recent streaming improvements already on master?

@edolstra
Copy link
Member

edolstra commented Mar 26, 2018

@shlevy Those improve copying paths between stores (including substitution), they don't improve addToStore from a (possibly filtered) path.

If I read this PR correctly, it reads the path twice: first to compute the hash / store path, second to copy it to the destination store path. This has a race: if the contents of the source path changes in between, we end up with a corrupt store path. (Note that this is a security bug.)

In general, we have a tension here between performance and memory usage. There are several alternatives:

  • The current implementation: read once into memory, hash, write to the store if the destination store path is not already valid. O(n) memory.

  • Hash first, then copy if the destination store path is not already valid (this PR). This requires reading and hashing twice which is obviously slower.

  • Copy to a temporary location in the store, then rename() if the destination store path is not already valid. This is single-pass but does unnecessary write I/O in the common case where the path is already valid.

  • Hybrid: do 1) if the path is small, otherwise do 2) or 3).

However, maybe this should be a "will not fix" since Nix is not really intended for use cases where you have enormous path dependencies, because it will always be slow. (Even in the best case, it needs to read/hash the entire path on every evaluation.) If you have a giant dependency, you're better off using fetchurl or something similar.

@shlevy
Copy link
Member

shlevy commented Mar 26, 2018

Hmm, IMO a hybrid approach is a good idea, especially since we already have the "large path" warning so we already have a sense of "small". The problem with fetchurl or similar is the need to know the hash in advance and it doesn't have the nice relative path resolution, without that I think that'd be a good solution here.

@edolstra
Copy link
Member

Well, you have to specify the hash to prevent having to read the file on every evaluation. Though I suppose builtins.fetchurl could have a store path cache similar to builtins.fetchGit (keyed on e.g. inode + mtime + file size).

@shlevy
Copy link
Member

shlevy commented Mar 26, 2018

Right, I'm saying there's a reasonable use case space for taking the hit of a long read on every eval to avoid manually specifying the hash.

Was actually going to separately suggest an mtime-based cache as well, so definitely let's do that 😀

@edolstra
Copy link
Member

Well, we don't have to support every reasonable use case if the cost in terms of complexity is too high. The hybrid approach is pretty complex, especially if you take RemoteStore into account.

@shlevy
Copy link
Member

shlevy commented Mar 26, 2018

@edolstra I guess the problem with the fetchurl approach is filters etc. Maybe what we want is a new option on builtins.path, so you have to opt-in to the stream-to-temp-file approach, and we can add caching to that or not.

I do agree we don't need to support all use cases of course, but I do think this is an (unfortunately) common one in using Nix for development. These repos grow huge and it's not always trivial to filter them (and sometimes the size is in actual source code...).

@edolstra
Copy link
Member

@volth Yeah, copy closure is largely fixed.

@shlevy shlevy added the backlog label Apr 1, 2018
@shlevy shlevy self-assigned this Apr 1, 2018
@Ericson2314
Copy link
Member

Ericson2314 commented Apr 8, 2018

If you have a giant dependency, you're better off using fetchurl or something similar.

But how does one get the hash in the first place to write the fetch*? Maybe the new Nix 2.0 handling of fixed-output hash mismatch failures helps?

Copy to a temporary location in the store, then rename...

Renaming is really cheap on any half-decent file-system. Let's just do that. In fact, don't we already do that approach to support that new behavior of avoiding the second fetch on hash mismatch which I mentioned above? If not, FWIW the intentional store also has to do the exact same thing, so I'd hope we'd end up writing this anyways.


Taking a step back, it's really important that we fix this. People new to Nix regularly hit this as they incrementally nixify some preexisting build process, as that usually means larger files than usual. When I evangelize Nix to someone, I need to offer up tons of disclaimers for things like this so that when they inevitably hit this, Nix and I don't instantly loose all credibility. A lot of issues like this one wouldn't require adding too much complexity to fix, but fixing them would greatly improve Nix's image, and the experience for Nix users.

@edolstra
Copy link
Member

edolstra commented Apr 9, 2018

Renaming is really cheap on any half-decent file-system.

The problem is not renaming, it's writing it to the store in the first place. This causes a huge amount of unnecessary I/O in the common case where the target path already exists. (Also you can run out of disk space.)

@edolstra
Copy link
Member

edolstra commented Apr 9, 2018

Okay, I took a stab at implementing hybrid option 3: edolstra@c94b4fc

@Ericson2314
Copy link
Member

Ericson2314 commented Apr 12, 2018

@edolstra

This causes a huge amount of unnecessary I/O in the common case where the target path already exists. (Also you can run out of disk space.)

First of all, very glad you are attempting a solution here! If we are concerned memory usage and disk usage, and secondarily complexity, @kmicklas give me the idea for this:

  1. hash . toNar everything in place
  2. If we don't have path, "cp -r" everything to temp location
  3. hash . toNar temp location
  4. Optional: warn if hash changed, store path is not corrupt either way so doesn't matter too much
  5. Move to proper location

Compared to the existing implementation, this always uses <= disk space and way smaller small O(1) memory, at the cost of 2x read I/O in the less common case where path is indeed new. For the small file common case that should be a negligible cost.

Compared to your edolstra@c94b4fc, besides the 2x reads, it avoids the complexity / discontinuity of an arbitrary cutoff, and avoids a fromNar . toNar round trip, giving the underlying filesystem a chance to optimize the cp -r (e.g. --reflink COW stuff).

What do you think?

@shlevy
Copy link
Member

shlevy commented Apr 12, 2018

@Ericson2314 I like that approach, though ultimately I think the tradeoff should be decided with profiling. In any case, for small files we needn't take the 2x I/O hit, as we can just continue reading it all into memory.

@shlevy
Copy link
Member

shlevy commented Apr 12, 2018

And we should possibly consider a cache (based on inode+size+mtime or whatever), and make it easy for a hash to be provided up-front to avoid this concern altogether.

@stale
Copy link

stale bot commented Feb 12, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Feb 12, 2021
@stale
Copy link

stale bot commented Apr 16, 2022

I closed this issue due to inactivity. → More info

@stale stale bot closed this Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants