-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build failures due to 'hash mismatch importing path' in hydra-queue-runner #816
Comments
For reference, I ran following command on the Hydra server and it succeeded:
However, the build is still failing due to another build step having the same problem. |
This is happening much more often now.
As far as I can tell, there is a hash mismatch but there is also a size mismatch:
However, I'm not sure yet why there are persistent size mismatches, in fact I find it a bit hard to understand the code due to so many layers of abstraction... |
I can support this observation. I also battle with this kind of errors. Normally, the path got realized by some other means, such that restarting the failed build makes progress. I have a parallel observation to contribute, which I suspect is related: The queue runner has sometimes trouble finishing off a job, hanging in "Receiving outputs" forever. The ssh connection on the build host is long gone by that time, but the ssh process forked by the queue runner still runs on the server and does nothing. I have also observed this behavior, when running a build on localhost (which does not involve EDIT: The build hosts run various versions of nix (with localhost obviously running hydra's version of nix). So, this is likely not related to a behavioral change in the protocol. |
I have also observed a weird issue with hydra-queue-runner's ssh connections, but slightly different from yours. Sometimes some build step gets stuck running for hours on one of the build machines without any progress, according to the Hydra web interface (this never happens for localhost, it only happens on the build machines). When I go to inspect the status of the processes on the Hydra server and the build machines, this is what I find: On the Hydra server, hydra-queue-runner has a ssh child process which has a valid TCP connection to the build machine. However, this sshd process on the build machine has no child process! It seems like the sshd server should have terminated the ssh connection when the child process died, but for some reason it didn't. Note that on both my Hydra server and my build machines, I have the following on my
... and the following on my
This behavior I have just described happens a lot to me, and I also suspect it has something to do with this bug but I have no idea how it can be related... |
This specific issue has been solved for me by upgrading my Hydra server to NixOS 20.09, even though my build machines weren't upgraded yet. However, I've observed another issue, similar to what @spacefrogg has said above (but I didn't diagnose anything related to ssh processes yet). I'm closing this issue for now, but feel free to reopen if you still observe this behavior in NixOS 20.09. |
I'm still experiencing this issue on NixOS 20.09. |
One possibility is that there is a network connectivity issue that is causing corruption/packet dropping between the hosts involved. You might try repeated transfers of nix copy or similarly sized files and verify the sha2sum after each transfer; even if only 1 out of 100 transfers is corrupted that would be significant since hydra regularly transfers store contents between workers and the hydra machine. |
@kquick: Thanks for the suggestion! It is true in my case that the communications between the two hosts are running under an experimental network, but I've been using this network for many months, doing thousands of connections and transferring 100s of GBs of information (over scp and rsync, which use ssh, and NFS) but I haven't observed any other problems so far (no dropped connections, etc). That said, I can see how it could be part of the problem, although the transfers shouldn't be getting corrupted (because ssh guarantees integrity) but it's possible that some transfers might be getting truncated, which could explain why I saw the size mismatches above. However, the bigger problem is that the failures become persistent. Once a derivation fails to get copied due to hash mismatch, it always fails with a hash mismatch (always with the same erroneous hash), even if I restart the build several times... Manually doing Also, it is interesting that the hash mismatches happen when building on both remote builders, even causing the same erroneous hashes on the same derivations but when copying from different remote hosts! |
Ok, I've been doing some debugging and it doesn't seem like this is a network problem. As far as I can see there is no corruption or truncation happening.
Backtrace:
Expected size:
Received size:
So the Nix daemon actually received a NAR file larger than it was expecting.
As you can see, it exactly matches the actual received hash and size on the Hydra server (but not the ones that were expected), even though the Hydra server was importing a different path!
So as far as I can tell, there was no corruption at all, it's just that |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
It turns out the previous 2 stack traces were due to an unrelated issue, so I'm back to where I was but I'm now running the latest commits of |
Ok, sorry for all the comments, but I've finally figured out where the bug is and it all makes sense now. This is definitely a bug in What's happening is that
Now let's say that the network connection between the Hydra server and the remote build server breaks while importing the At some point, network connection is reestablished and But now the following events happen:
The actual code is a lot more convoluted than what the above events suggest, but one possible solution is simply for I will submit a PR soon. |
This would happen if the network connection between the Hydra server and the remote build server breaks after sucessfully importing at least one output of a derivation, but before having finished importing all outputs. Fixes NixOS#816.
This would start happening if the network connection between the Hydra server and the remote build server breaks after sucessfully importing at least one output of a derivation, but before having finished importing all outputs. Fixes NixOS#816.
This would start happening if the network connection between the Hydra server and the remote build server breaks after sucessfully importing at least one output of a derivation, but before having finished importing all outputs. Fixes NixOS#816.
I'm running into a persistent issue with Hydra and I was hoping someone could help me debug it.
I'm getting the following
hydra-queue-runner
error on one of the build steps:As far as I understand, this seems to happen while importing the derivation from the remote builder.
The error seems to be persistent, which means that the builds end up getting aborted after some number of retries.
This was happening on only one of the remote builders, when building another derivation (the nix package itself).
However, now the same build step (this perl package) seems to fail with both remote builders, down to the same hash values.
These are all the `hydra-queue-runner` log messages printed when the build step fails.
This is my
/etc/nix/machines
file in my Hydra server, for reference:Any ideas on how I can debug this problem?
The text was updated successfully, but these errors were encountered: