Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hydra-queue-runner gets stuck while there are items in the queue #366

Open
domenkozar opened this issue Jul 25, 2016 · 19 comments
Open

hydra-queue-runner gets stuck while there are items in the queue #366

domenkozar opened this issue Jul 25, 2016 · 19 comments
Labels

Comments

@domenkozar
Copy link
Member

(gdb) bt
#0  0x00007f31a596940d in poll () from /nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f31a64db2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f31a64dc4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f31a64e0246 in pqxx::connection_base::await_notification(long, long) () from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f31a7771802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from /nix/store/r3blahkd7z58xsdbgdyl5hbfgaqn0var-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

Happy to provide more information while the servers is still stuck, but I'll have to restart it today/tomorrow.

It seems to me it's waiting on postgresql?

cc @edolstra

@domenkozar
Copy link
Member Author

Ah, it got interrupted after a day:

Jul 25 11:36:53 eiger hydra-queue-runner[30255]: got notification: new builds added to the queue
Jul 25 11:36:53 eiger hydra-queue-runner[30255]: checking the queue for builds > 168416...

I think this doesn't happen often on hydra.nixos.org since new builds come all the time, but something is keeping Hydra queue runner from finishing the queue.

@domenkozar domenkozar added the bug label Jul 25, 2016
@domenkozar
Copy link
Member Author

Note: I'm running commit dc790c5

@domenkozar
Copy link
Member Author

Happened again:

#0  0x00007f232e88840d in poll () from /nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f232f3fa2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f232f3fb4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f232f3ff246 in pqxx::connection_base::await_notification(long, long) ()
   from /nix/store/zclq81axk9yg3rv4xlimq7vsqik5a2aw-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f2330690802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) () from /nix/store/r3blahkd7z58xsdbgdyl5hbfgaqn0var-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

@domenkozar
Copy link
Member Author

@edolstra any ideas what's wrong here?

@domenkozar
Copy link
Member Author

Again:

(gdb) bt
#0  0x00007f43ebccb40d in poll ()
   from target:/nix/store/gwl3ppqj4i730nhd4f50ncl5jc4n97ks-glibc-2.23/lib/libc.so.6
#1  0x00007f43ec83d2b7 in (anonymous namespace)::wait_fd(int, bool, timeval*) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#2  0x00007f43ec83e4c1 in pqxx::internal::wait_read(pg_conn const*, long, long) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#3  0x00007f43ec842246 in pqxx::connection_base::await_notification(long, long) ()
   from target:/nix/store/g5h8ip9279jb0y4h3w35rk4qpj7rxq4d-libpqxx-4.0.1/lib/libpqxx-4.0.so
#4  0x00000000004196c4 in State::run(unsigned int) ()
#5  0x000000000041a272 in main::{lambda()#1}::operator()() const ()
#6  0x00007f43edad3802 in nix::handleExceptions(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>) ()
   from target:/nix/store/7isk02splr10h3a3q9a2nq7grfqymynv-nix-1.12pre4523_3b81b26/lib/libnixmain.so
#7  0x000000000040eada in main ()

@domenkozar
Copy link
Member Author

Note: I'm using systemctl restart hydra-queue-runner as a workaround to get queue back to processing.

@mbrock
Copy link

mbrock commented Mar 9, 2018

I also have a habit of restarting the queue runner. There's some lockup almost every other time I submit a new evaluation.

@LisannaAtHome
Copy link

Also running into this (I have dc5e0b1), and kicking hydra-queue-runner works.

This seems like a critical bug for general hydra usability, and will definitely turn off non-experts. I'm at least documenting this on the wiki since it seems like everyone with a private hydra instance deals with this.

@LisannaAtHome
Copy link

LisannaAtHome commented Jul 20, 2018

This also breaks declarative jobsets I've noticed. If hydra-queue-runner gets into its stuck state, even if a .jobsets has been evaluated and even built recently, the list of jobsets will not update without restarting hydra-queue-runner.

I now have a cron job that checks hydra-queue-runner every minute and kicks it if there are >0 jobs queued and 0 jobs running, or if there are 0 jobs queued and 0 jobs running (to handle the declarative jobsets case). Seems to work well, but tricky to implement since you have to query both the Hydra JSON API to get the number of jobs running, and scrape an HTML page to get the number of jobs queued. You don't want to restart hydra-queue-runner if there are running jobs that take a long time because those will get canceled and have to start from the beginning.

@LisannaAtHome
Copy link

Duplicate issue #550

Applying 73ca325 seems to have fixed this issue for me

@tbenst
Copy link

tbenst commented Dec 23, 2019

This just happened to me on a new hydra deploy. The first three builds worked fine. I canceled the fourth build partway through. Now the fifth build is queued and won’t run. systemctl restart hydra-queue-runner did not fix unfortunately

@tbenst
Copy link

tbenst commented Dec 23, 2019

Ok, in my case the problem seems to result from interrupting a build before it finished. I interrupted a job during the build of cuda_10.1 and any time I try to restart, that seems to be related to the hang:

> journalctl -u hydra-queue-runner.service
Dec 23 18:11:47 hydra-server hydra-queue-runner[15353]: queue monitor: std::bad_alloc
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: checking the queue for builds > 0...
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: loading build 5 (tyler:babelfish:babelfish)
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying output ‘/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run’ of ‘/nix/store/zgzivax9qf558rw4wkrcmjfbd09gpmpc-cuda_10.1.2vsr6f78r0a-cuda_10.1.243_418.87.00_linux.run’ of ‘/nix/store/zgzivax9qf558rw4wkrcmjfbd09gpmpc-cuda_10.1.243_418.87.00_linux.run.drv’ from local store
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying 1 paths...
Dec 23 18:11:57 hydra-server hydra-queue-runner[15353]: copying path '/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run' to 'file:///var/lib/hydra/cache'...r6f78r0a-cuda_10.1.243_418.87.00_linux.run' to 'file:///var/lib/hydra/cache'...

ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run is never copied to /var/lib/hydra/cache/*. I tried nix-collect-garbage, but the .run file remains, and I cannot rm the file due to read-only filesystem. Any ideas how to fix this broken state? A bit surprised that I could bork the system by cancelling a build.

Also, let me know if I should make a new issue, as I'm no longer sure if this is the same as @domenkozar's original post.

Edit: just realized that the .run file is the url for src = fetchurl ... so it makes sense why that's in nix-store. It's a 2.4G file, but not sure why the copying path command never finishes.

@tomberek
Copy link
Contributor

tomberek commented Jan 2, 2020

Not sure if related, but I’ve noticed large derivations (several GiB) take a long time to copy. Either to the local store at the conclusion of a build (like a fetch), or to a remote store. My rough guess is that there me is some quadratic performance characteristic.

If that is the case, it usually sits for a while with no output and a single nix thread spinning at 100%. Perhaps this happened and a 15 minute timeout was reached?

@tbenst
Copy link

tbenst commented Jan 2, 2020

@tomberek interesting, thanks! That helped me fix it. It looks like the queue monitor: std::bad_alloc consistently happens 1-3 minutes after copying path [...].run begins. I bumped up to a t2.large (8GB ram), and got the bad_alloc error after 17 minutes. Now bumped to a t2.xlarge (16GB RAM) and the copy command finally finished:

Jan 02 20:33:22 hydra-server hydra-queue-runner[1120]: copying path '/nix/store/ip6149gj1ip9l8yzb2lizyvsr6f78r0a-cuda_10.1.243_418.87.00_linux.run' (2572375416 bytes, compressed 1.4% in 1168394 ms) to binary cache

19 minutes and 16GB of RAM is crazy for copying a 2.4GB file, but hey it works now. Is this performance problem a nix or hydra problem?

Ironically, it's faster and more memory efficient to grab the .run file from remote than store!! From remote, I could build on a box with only 4GB of RAM.

Edit: my point about time may be unfair. It's possible this was slow due to EBS throttling by AWS. Although I think the memory question is still valid, and it'd be good to know where I should file the issue

@tomberek
Copy link
Contributor

tomberek commented Jan 4, 2020

This is a nix problem. But I've been unable to discover more details.

@tomberek
Copy link
Contributor

tomberek commented Jan 4, 2020

Upon more investigation, it seems this may be due to the compression. This is normally hidden by other costs and parallelized by having smaller derivations/folders/files. Large files make this more apparent. Using s3://bucket?parallel-compression=true can alleviate the situation. I'm looking hopefully at this: NixOS/nix#2276

@xbreak
Copy link

xbreak commented Jun 26, 2020

We had frequent problems with hydra-queue-runner not processing build queue.
But since a a while I added a 5 minute maximum wait for await_notification and have not experienced an issue since *knocks on wood*.

I've seen at least one case where there was work pending but the notification did not trigger. I know this from logging some extra info.
Yes ok, it is possible, but unlikely, that the notification was scheduled to be triggered a time instant after await_notification timed out.

So maybe it make sense to add a configuration option that allows user to set the maximum wait time? Seems better than restarting queue runner in a cron-job.

diff --git a/src/hydra-queue-runner/queue-monitor.cc b/src/hydra-queue-runner/queue-monitor.cc
--- a/src/hydra-queue-runner/queue-monitor.cc
+++ b/src/hydra-queue-runner/queue-monitor.cc
@@ -42,27 +42,37 @@ void State::queueMonitorLoop()

         /* Sleep until we get notification from the database about an
            event. */
         if (done) {
-            conn->await_notification();
+            conn->await_notification(5*60, 0);
             nrQueueWakeups++;
         } else

Update:

No further cases (Jun 26th - Aug 6th) of queue-runner getting stuck, so this has solved the problem for us.

HackerFoo added a commit to HackerFoo/hydra that referenced this issue Aug 28, 2020
@FlorianFranzen
Copy link

The nixos.wiki point to this commit as a possible fix, which is part of #597 and reported by some users to work,

I would love to see this fixed, so would this commit be a generally accepted way to fix this issue?

t184256 added a commit to t184256/hydra that referenced this issue Apr 29, 2022
blurgyy added a commit to blurgyy/flames that referenced this issue Aug 31, 2022
A build step is performed but never stops, on the hydra host machine
found a process `nix-store --builders --serve --write` that waits
indefinitely, `strace` shows the process is reading from a pipe but
could not get any data, on the pipe's write end there is another process
that polls for something (did not look into this) and reads from a
socket, `lsof` then shows the socket's read and write ends are both from
the second process.  After reverting to nix 2.10.3 the hanging problem
is gone.

Possibly related:
  - <NixOS/hydra#366>
  - <NixOS/nix#2560>
  - <NixOS/nix#2260>

Revert "chore: bump flake inputs"

This reverts commit 1d75b21.

Signed-off-by: Gaoyang Zhang <gy@blurgy.xyz>
@ivan
Copy link
Member

ivan commented Apr 3, 2023

I applied the patch in #366 (comment) five months ago, and since then I haven't needed to restart Hydra anywhere near as often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants