hydra-queue-runner gets stuck while there are items in the queue #366

domenkozar · 2016-07-25T11:34:29Z

(gdb) bt
#0  0x00007f31a596940d in poll () from /nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f31a64db2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f31a64dc4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f31a64e0246 in pqxx::connection_base::await_notification(long, long) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f31a7771802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from /nix/store/r3blahkd7z58xsdbgdyl5hbfgaqn0var-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

Happy to provide more information while the servers is still stuck, but I'll have to restart it today/tomorrow.

It seems to me it's waiting on postgresql?

cc @edolstra

The text was updated successfully, but these errors were encountered:

domenkozar · 2016-07-25T11:39:33Z

Ah, it got interrupted after a day:

Jul 25 11:36:53 eiger hydra-queue-runner[30255]: got notification: new builds added to the queue
Jul 25 11:36:53 eiger hydra-queue-runner[30255]: checking the queue for builds > 168416...

I think this doesn't happen often on hydra.nixos.org since new builds come all the time, but something is keeping Hydra queue runner from finishing the queue.

domenkozar · 2016-07-28T12:57:37Z

Note: I'm running commit dc790c5

domenkozar · 2016-07-30T09:54:25Z

Happened again:

#0  0x00007f232e88840d in poll () from /nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f232f3fa2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f232f3fb4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f232f3ff246 in pqxx::connection_base::await_notification(long, long) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f2330690802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) () from /nix/store/r3blahkd7z58xsdbgdyl5hbfgaqn0var-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

domenkozar · 2016-07-30T09:54:42Z

@edolstra any ideas what's wrong here?

domenkozar · 2016-10-21T12:20:07Z

Again:

(gdb) bt
#0  0x00007f43ebccb40d in poll ()
   from target:/nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f43ec83d2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f43ec83e4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f43ec842246 in pqxx::connection_base::await_notification(long, long) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f43edad3802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from target:/nix/store/7isk02splr10h3a3q9a2nq7grfqymynv-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

domenkozar · 2016-10-21T12:22:02Z

Note: I'm using systemctl restart hydra-queue-runner as a workaround to get queue back to processing.

mbrock · 2018-03-09T06:43:17Z

I also have a habit of restarting the queue runner. There's some lockup almost every other time I submit a new evaluation.

LisannaAtHome · 2018-07-17T23:38:44Z

Also running into this (I have dc5e0b1), and kicking hydra-queue-runner works.

This seems like a critical bug for general hydra usability, and will definitely turn off non-experts. I'm at least documenting this on the wiki since it seems like everyone with a private hydra instance deals with this.

LisannaAtHome · 2018-07-20T01:47:53Z

This also breaks declarative jobsets I've noticed. If hydra-queue-runner gets into its stuck state, even if a .jobsets has been evaluated and even built recently, the list of jobsets will not update without restarting hydra-queue-runner.

I now have a cron job that checks hydra-queue-runner every minute and kicks it if there are >0 jobs queued and 0 jobs running, or if there are 0 jobs queued and 0 jobs running (to handle the declarative jobsets case). Seems to work well, but tricky to implement since you have to query both the Hydra JSON API to get the number of jobs running, and scrape an HTML page to get the number of jobs queued. You don't want to restart hydra-queue-runner if there are running jobs that take a long time because those will get canceled and have to start from the beginning.

LisannaAtHome · 2018-11-07T08:24:33Z

Duplicate issue #550

Applying 73ca325 seems to have fixed this issue for me

tbenst · 2019-12-23T10:03:16Z

This just happened to me on a new hydra deploy. The first three builds worked fine. I canceled the fourth build partway through. Now the fifth build is queued and won’t run. systemctl restart hydra-queue-runner did not fix unfortunately

tbenst · 2019-12-23T18:27:26Z

Ok, in my case the problem seems to result from interrupting a build before it finished. I interrupted a job during the build of cuda_10.1 and any time I try to restart, that seems to be related to the hang:

> journalctl -u hydra-queue-runner.service
Dec 23 18:11:47 hydra-server hydra-queue-runner[15353]: queue monitor: std::bad_alloc
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: checking the queue for builds > 0...
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: loading build 5 (tyler:babelfish:babelfish)
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying output ‘/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run’ of ‘/nix/store/zgzivax9qf558rw4wkrcmjfbd09gpmpc-cuda_10.1.2vsr6f78r0a-cuda_10.1.243_418.87.00_linux.run’ of ‘/nix/store/zgzivax9qf558rw4wkrcmjfbd09gpmpc-cuda_10.1.243_418.87.00_linux.run.drv’ from local store
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying 1 paths...
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying path '/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run' to 'file:///var/lib/hydra/cache'...r6f78r0a-cuda_10.1.243_418.87.00_linux.run' to 'file:///var/lib/hydra/cache'...

ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run is never copied to /var/lib/hydra/cache/*. I tried nix-collect-garbage, but the .run file remains, and I cannot rm the file due to read-only filesystem. Any ideas how to fix this broken state? A bit surprised that I could bork the system by cancelling a build.

Also, let me know if I should make a new issue, as I'm no longer sure if this is the same as @domenkozar's original post.

Edit: just realized that the .run file is the url for src = fetchurl ... so it makes sense why that's in nix-store. It's a 2.4G file, but not sure why the copying path command never finishes.

tomberek · 2020-01-02T13:13:17Z

Not sure if related, but I’ve noticed large derivations (several GiB) take a long time to copy. Either to the local store at the conclusion of a build (like a fetch), or to a remote store. My rough guess is that there me is some quadratic performance characteristic.

If that is the case, it usually sits for a while with no output and a single nix thread spinning at 100%. Perhaps this happened and a 15 minute timeout was reached?

tbenst · 2020-01-02T20:45:14Z

@tomberek interesting, thanks! That helped me fix it. It looks like the queue monitor: std::bad_alloc consistently happens 1-3 minutes after copying path [...].run begins. I bumped up to a t2.large (8GB ram), and got the bad_alloc error after 17 minutes. Now bumped to a t2.xlarge (16GB RAM) and the copy command finally finished:

Jan 02 20:33:22 hydra-server hydra-queue-runner[1120]: copying path '/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run' (2572375416 bytes, compressed 1.4% in 1168394 ms) to binary cache

19 minutes and 16GB of RAM is crazy for copying a 2.4GB file, but hey it works now. Is this performance problem a nix or hydra problem?

Ironically, it's faster and more memory efficient to grab the .run file from remote than store!! From remote, I could build on a box with only 4GB of RAM.

Edit: my point about time may be unfair. It's possible this was slow due to EBS throttling by AWS. Although I think the memory question is still valid, and it'd be good to know where I should file the issue

tomberek · 2020-01-04T12:11:54Z

This is a nix problem. But I've been unable to discover more details.

tomberek · 2020-01-04T16:34:54Z

Upon more investigation, it seems this may be due to the compression. This is normally hidden by other costs and parallelized by having smaller derivations/folders/files. Large files make this more apparent. Using s3://bucket?parallel-compression=true can alleviate the situation. I'm looking hopefully at this: NixOS/nix#2276

xbreak · 2020-06-26T11:19:55Z

We had frequent problems with hydra-queue-runner not processing build queue.
But since a a while I added a 5 minute maximum wait for await_notification and have not experienced an issue since *knocks on wood*.

I've seen at least one case where there was work pending but the notification did not trigger. I know this from logging some extra info.
Yes ok, it is possible, but unlikely, that the notification was scheduled to be triggered a time instant after await_notification timed out.

So maybe it make sense to add a configuration option that allows user to set the maximum wait time? Seems better than restarting queue runner in a cron-job.

diff --git a/src/hydra-queue-runner/queue-monitor.cc b/src/hydra-queue-runner/queue-monitor.cc
--- a/src/hydra-queue-runner/queue-monitor.cc
+++ b/src/hydra-queue-runner/queue-monitor.cc
@@ -42,27 +42,37 @@ void State::queueMonitorLoop()

         /* Sleep until we get notification from the database about an
            event. */
         if (done) {
-            conn->await_notification();
+            conn->await_notification(5*60, 0);
             nrQueueWakeups++;
         } else

Update:

No further cases (Jun 26th - Aug 6th) of queue-runner getting stuck, so this has solved the problem for us.

FlorianFranzen · 2020-12-29T10:59:00Z

The nixos.wiki point to this commit as a possible fix, which is part of #597 and reported by some users to work,

I would love to see this fixed, so would this commit be a generally accepted way to fix this issue?

A build step is performed but never stops, on the hydra host machine found a process `nix-store --builders --serve --write` that waits indefinitely, `strace` shows the process is reading from a pipe but could not get any data, on the pipe's write end there is another process that polls for something (did not look into this) and reads from a socket, `lsof` then shows the socket's read and write ends are both from the second process. After reverting to nix 2.10.3 the hanging problem is gone. Possibly related: - <NixOS/hydra#366> - <NixOS/nix#2560> - <NixOS/nix#2260> Revert "chore: bump flake inputs" This reverts commit 1d75b21. Signed-off-by: Gaoyang Zhang <gy@blurgy.xyz>

ivan · 2023-04-03T09:45:46Z

I applied the patch in #366 (comment) five months ago, and since then I haven't needed to restart Hydra anywhere near as often.

domenkozar added the bug label Jul 25, 2016

domenkozar mentioned this issue Jul 22, 2017

Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

Closed

domenkozar mentioned this issue Sep 4, 2018

Deadlock with remote builders NixOS/nix#2260

Open

tbenst mentioned this issue Jan 4, 2020

Fix adding large paths to the store NixOS/nix#619

Closed

basvandijk mentioned this issue Feb 5, 2020

No step created for build #706

Open

HackerFoo added a commit to HackerFoo/hydra that referenced this issue Aug 28, 2020

fix suggested in NixOS#366

a184c87

FlorianFranzen mentioned this issue May 20, 2021

Commit 8833983f in nixpkgs overrides hydra-init.service's permissions on every NixOS Activation #966

Open

t184256 added a commit to t184256/hydra that referenced this issue Apr 29, 2022

fix suggested in NixOS#366

07e3439

blurgyy mentioned this issue Sep 1, 2022

Hydra build hangs after step finishes or fails with nix 2.11.0 NixOS/nix#6981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hydra-queue-runner gets stuck while there are items in the queue #366

hydra-queue-runner gets stuck while there are items in the queue #366

domenkozar commented Jul 25, 2016

domenkozar commented Jul 25, 2016

domenkozar commented Jul 28, 2016

domenkozar commented Jul 30, 2016

domenkozar commented Jul 30, 2016

domenkozar commented Oct 21, 2016

domenkozar commented Oct 21, 2016

mbrock commented Mar 9, 2018

LisannaAtHome commented Jul 17, 2018

LisannaAtHome commented Jul 20, 2018 •

edited

Loading

LisannaAtHome commented Nov 7, 2018

tbenst commented Dec 23, 2019

tbenst commented Dec 23, 2019 •

edited

Loading

tomberek commented Jan 2, 2020

tbenst commented Jan 2, 2020 •

edited

Loading

tomberek commented Jan 4, 2020

tomberek commented Jan 4, 2020

xbreak commented Jun 26, 2020 •

edited

Loading

FlorianFranzen commented Dec 29, 2020

ivan commented Apr 3, 2023

hydra-queue-runner gets stuck while there are items in the queue #366

hydra-queue-runner gets stuck while there are items in the queue #366

Comments

domenkozar commented Jul 25, 2016

domenkozar commented Jul 25, 2016

domenkozar commented Jul 28, 2016

domenkozar commented Jul 30, 2016

domenkozar commented Jul 30, 2016

domenkozar commented Oct 21, 2016

domenkozar commented Oct 21, 2016

mbrock commented Mar 9, 2018

LisannaAtHome commented Jul 17, 2018

LisannaAtHome commented Jul 20, 2018 • edited Loading

LisannaAtHome commented Nov 7, 2018

tbenst commented Dec 23, 2019

tbenst commented Dec 23, 2019 • edited Loading

tomberek commented Jan 2, 2020

tbenst commented Jan 2, 2020 • edited Loading

tomberek commented Jan 4, 2020

tomberek commented Jan 4, 2020

xbreak commented Jun 26, 2020 • edited Loading

FlorianFranzen commented Dec 29, 2020

ivan commented Apr 3, 2023

LisannaAtHome commented Jul 20, 2018 •

edited

Loading

tbenst commented Dec 23, 2019 •

edited

Loading

tbenst commented Jan 2, 2020 •

edited

Loading

xbreak commented Jun 26, 2020 •

edited

Loading