-
-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hydra-queue-runner gets stuck while there are items in the queue #366
Comments
Ah, it got interrupted after a day:
I think this doesn't happen often on hydra.nixos.org since new builds come all the time, but something is keeping Hydra queue runner from finishing the queue. |
Note: I'm running commit dc790c5 |
Happened again:
|
@edolstra any ideas what's wrong here? |
Again:
|
Note: I'm using |
I also have a habit of restarting the queue runner. There's some lockup almost every other time I submit a new evaluation. |
Also running into this (I have dc5e0b1), and kicking hydra-queue-runner works. This seems like a critical bug for general hydra usability, and will definitely turn off non-experts. I'm at least documenting this on the wiki since it seems like everyone with a private hydra instance deals with this. |
This also breaks declarative jobsets I've noticed. If hydra-queue-runner gets into its stuck state, even if a .jobsets has been evaluated and even built recently, the list of jobsets will not update without restarting hydra-queue-runner. I now have a cron job that checks hydra-queue-runner every minute and kicks it if there are >0 jobs queued and 0 jobs running, or if there are 0 jobs queued and 0 jobs running (to handle the declarative jobsets case). Seems to work well, but tricky to implement since you have to query both the Hydra JSON API to get the number of jobs running, and scrape an HTML page to get the number of jobs queued. You don't want to restart hydra-queue-runner if there are running jobs that take a long time because those will get canceled and have to start from the beginning. |
This just happened to me on a new hydra deploy. The first three builds worked fine. I canceled the fourth build partway through. Now the fifth build is queued and won’t run. |
Ok, in my case the problem seems to result from interrupting a build before it finished. I interrupted a job during the build of cuda_10.1 and any time I try to restart, that seems to be related to the hang:
Also, let me know if I should make a new issue, as I'm no longer sure if this is the same as @domenkozar's original post. Edit: just realized that the |
Not sure if related, but I’ve noticed large derivations (several GiB) take a long time to copy. Either to the local store at the conclusion of a build (like a fetch), or to a remote store. My rough guess is that there me is some quadratic performance characteristic. If that is the case, it usually sits for a while with no output and a single nix thread spinning at 100%. Perhaps this happened and a 15 minute timeout was reached? |
@tomberek interesting, thanks! That helped me fix it. It looks like the
19 minutes and 16GB of RAM is crazy for copying a 2.4GB file, but hey it works now. Is this performance problem a nix or hydra problem? Ironically, it's faster and more memory efficient to grab the .run file from remote than store!! From remote, I could build on a box with only 4GB of RAM. Edit: my point about time may be unfair. It's possible this was slow due to EBS throttling by AWS. Although I think the memory question is still valid, and it'd be good to know where I should file the issue |
This is a nix problem. But I've been unable to discover more details. |
Upon more investigation, it seems this may be due to the compression. This is normally hidden by other costs and parallelized by having smaller derivations/folders/files. Large files make this more apparent. Using |
We had frequent problems with hydra-queue-runner not processing build queue. I've seen at least one case where there was work pending but the notification did not trigger. I know this from logging some extra info. So maybe it make sense to add a configuration option that allows user to set the maximum wait time? Seems better than restarting queue runner in a cron-job. diff --git a/src/hydra-queue-runner/queue-monitor.cc b/src/hydra-queue-runner/queue-monitor.cc
--- a/src/hydra-queue-runner/queue-monitor.cc
+++ b/src/hydra-queue-runner/queue-monitor.cc
@@ -42,27 +42,37 @@ void State::queueMonitorLoop()
/* Sleep until we get notification from the database about an
event. */
if (done) {
- conn->await_notification();
+ conn->await_notification(5*60, 0);
nrQueueWakeups++;
} else Update: No further cases (Jun 26th - Aug 6th) of queue-runner getting stuck, so this has solved the problem for us. |
The nixos.wiki point to this commit as a possible fix, which is part of #597 and reported by some users to work, I would love to see this fixed, so would this commit be a generally accepted way to fix this issue? |
A build step is performed but never stops, on the hydra host machine found a process `nix-store --builders --serve --write` that waits indefinitely, `strace` shows the process is reading from a pipe but could not get any data, on the pipe's write end there is another process that polls for something (did not look into this) and reads from a socket, `lsof` then shows the socket's read and write ends are both from the second process. After reverting to nix 2.10.3 the hanging problem is gone. Possibly related: - <NixOS/hydra#366> - <NixOS/nix#2560> - <NixOS/nix#2260> Revert "chore: bump flake inputs" This reverts commit 1d75b21. Signed-off-by: Gaoyang Zhang <gy@blurgy.xyz>
I applied the patch in #366 (comment) five months ago, and since then I haven't needed to restart Hydra anywhere near as often. |
Happy to provide more information while the servers is still stuck, but I'll have to restart it today/tomorrow.
It seems to me it's waiting on postgresql?
cc @edolstra
The text was updated successfully, but these errors were encountered: