-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix adding paths to the store (try 3) #1754
Conversation
src/libstore/local-store.cc
Outdated
copyPath(srcPath, dstPath, filter, recursive); | ||
|
||
canonicalisePathMetaData(dstPath, -1); | ||
HashResult hash = hashPath(htSHA256, dstPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we copy then hash? How does this differ from the hashPath
call above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this hashPath
is redundant. Assuming that canonicalisePathMetaData
doesn't affect the hash, it seems like we should be able to use the hash h
computed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mathnerd314, do you recall why this second hash was needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment over here. The first hash computes the output path, the second is the SHA256 of the NAR serialization which is used for validity checking. I guess the code could re-use the hash in the case of recursive SHA256, but hashing isn't that expensive.
@@ -412,9 +412,16 @@ Path RemoteStore::addToStore(const string & name, const Path & _srcPath, | |||
|
|||
auto conn(connections->get()); | |||
|
|||
if (GET_PROTOCOL_MINOR(conn->daemonVersion) < 0x15) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a pretty tight compatibility requirement :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I honestly don't know whether we could improve the compatibility story here. Ideas are certainly welcome.
Recent commits (based on earlier version of this I think?) should alleviate much of this, can this be closed or is it still relevant? |
I will happily find time to look into the failures; however, I would first like some feedback on the acceptability of the general approach. |
@bgamari In a general sense I'm very in favor of this, but can you provide a high level view of what this is doing on top of the recent streaming improvements already on master? |
@shlevy Those improve copying paths between stores (including substitution), they don't improve If I read this PR correctly, it reads the path twice: first to compute the hash / store path, second to copy it to the destination store path. This has a race: if the contents of the source path changes in between, we end up with a corrupt store path. (Note that this is a security bug.) In general, we have a tension here between performance and memory usage. There are several alternatives:
However, maybe this should be a "will not fix" since Nix is not really intended for use cases where you have enormous path dependencies, because it will always be slow. (Even in the best case, it needs to read/hash the entire path on every evaluation.) If you have a giant dependency, you're better off using |
Hmm, IMO a hybrid approach is a good idea, especially since we already have the "large path" warning so we already have a sense of "small". The problem with fetchurl or similar is the need to know the hash in advance and it doesn't have the nice relative path resolution, without that I think that'd be a good solution here. |
Well, you have to specify the hash to prevent having to read the file on every evaluation. Though I suppose |
Right, I'm saying there's a reasonable use case space for taking the hit of a long read on every eval to avoid manually specifying the hash. Was actually going to separately suggest an mtime-based cache as well, so definitely let's do that 😀 |
Well, we don't have to support every reasonable use case if the cost in terms of complexity is too high. The hybrid approach is pretty complex, especially if you take RemoteStore into account. |
@edolstra I guess the problem with the fetchurl approach is filters etc. Maybe what we want is a new option on I do agree we don't need to support all use cases of course, but I do think this is an (unfortunately) common one in using Nix for development. These repos grow huge and it's not always trivial to filter them (and sometimes the size is in actual source code...). |
@volth Yeah, copy closure is largely fixed. |
But how does one get the hash in the first place to write the
Renaming is really cheap on any half-decent file-system. Let's just do that. In fact, don't we already do that approach to support that new behavior of avoiding the second fetch on hash mismatch which I mentioned above? If not, FWIW the intentional store also has to do the exact same thing, so I'd hope we'd end up writing this anyways. Taking a step back, it's really important that we fix this. People new to Nix regularly hit this as they incrementally nixify some preexisting build process, as that usually means larger files than usual. When I evangelize Nix to someone, I need to offer up tons of disclaimers for things like this so that when they inevitably hit this, Nix and I don't instantly loose all credibility. A lot of issues like this one wouldn't require adding too much complexity to fix, but fixing them would greatly improve Nix's image, and the experience for Nix users. |
The problem is not renaming, it's writing it to the store in the first place. This causes a huge amount of unnecessary I/O in the common case where the target path already exists. (Also you can run out of disk space.) |
Okay, I took a stab at implementing hybrid option 3: edolstra@c94b4fc |
First of all, very glad you are attempting a solution here! If we are concerned memory usage and disk usage, and secondarily complexity, @kmicklas give me the idea for this:
Compared to the existing implementation, this always uses <= disk space and way smaller small O(1) memory, at the cost of 2x read I/O in the less common case where path is indeed new. For the small file common case that should be a negligible cost. Compared to your edolstra@c94b4fc, besides the 2x reads, it avoids the complexity / discontinuity of an arbitrary cutoff, and avoids a What do you think? |
@Ericson2314 I like that approach, though ultimately I think the tradeoff should be decided with profiling. In any case, for small files we needn't take the 2x I/O hit, as we can just continue reading it all into memory. |
And we should possibly consider a cache (based on inode+size+mtime or whatever), and make it easy for a hash to be provided up-front to avoid this concern altogether. |
I marked this as stale due to inactivity. → More info |
I closed this issue due to inactivity. → More info |
This is a rebased version of the patch originally authored by @Mathnerd314 and proposed for merge in #619. It refactors the
addToStore
logic to stream data from the file system, avoiding the need to read potentially large files into memory. Note that this patch, unlike the version initially put forth in #619, incurs no dependency on Boost.So far this is only build tested. However, I would like review on the overall idea.