Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

dtzWill · 2017-07-21T02:48:02Z

After months on-and-off debugging why my Hydra instance periodically decides to ignore jobs in the queue, I think I have something (cc #433).
Jobs are processed properly if/when hydra-queue-runner is restarted or a build is restarted (suggested explanation).

I believe this is due to an incorrect assumption about transaction commit order and the auto-incrementing values used for build id (serial in postgresql).

Hydra's queue monitor maintains a lastBuildId value that essentially tracks the largest ID of all processed builds. This is used to greatly reduce the burden of checking the build table by only looking for builds that have an identifier larger than this value (relying on the auto-increment property of row insertion on the id column).

Builds are added by hydra-eval-jobset which is executed concurrently.

Unfortunately it appears that because identifier generation is /not/ tied to transaction commit (intentionally),
it's possible for the queue monitor to see builds with identifiers greater than builds that are currently not yet committed.

(I am not a database expert but googling around on the subject turns up numerous articles describing the intricacies of this behavior)

I'm not quite sure how to best fix this, unfortunately, but am hoping once the issue is agreed to be real (and my diagnosis passes your peer review) a solution can be worked towards :).

The "dumb" solution I'm trying to improve upon (surely it's possible, right?) is to simply periodically re-scan the entire queue for missed jobs. It's unsatisfying but since this is a relatively rare occurrence this approach would probably solve the job well enough in practice.

Other solutions involve aspects of table locking and database schema design that frankly I think others are probably much more qualified to discuss :).

Thanks for your time and reading this issue ❤️

The text was updated successfully, but these errors were encountered:

edolstra · 2017-07-21T12:37:28Z

@dtzWill Great catch! I've fixed it by having hydra-eval-jobset tell hydra-queue-runner about the lowest build ID it just added so it can load the queue from there.

dtzWill · 2017-07-21T15:14:32Z

Great, thanks for the quick fix! 🥇

domenkozar · 2017-07-22T18:43:39Z

Interesting, I've opened #366 before concurrent evaluator was written, but I'm interested to see if this fixes the issue (I currently run a cronjob restarting the queue runner)

edolstra closed this as completed in dc5e0b1 Jul 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

dtzWill commented Jul 21, 2017

edolstra commented Jul 21, 2017

dtzWill commented Jul 21, 2017

domenkozar commented Jul 22, 2017

Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

Comments

dtzWill commented Jul 21, 2017

edolstra commented Jul 21, 2017

dtzWill commented Jul 21, 2017

domenkozar commented Jul 22, 2017