You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After months on-and-off debugging why my Hydra instance periodically decides to ignore jobs in the queue, I think I have something (cc #433).
Jobs are processed properly if/when hydra-queue-runner is restarted or a build is restarted (suggested explanation).
I believe this is due to an incorrect assumption about transaction commit order and the auto-incrementing values used for build id (serial in postgresql).
Hydra's queue monitor maintains a lastBuildId value that essentially tracks the largest ID of all processed builds. This is used to greatly reduce the burden of checking the build table by only looking for builds that have an identifier larger than this value (relying on the auto-increment property of row insertion on the id column).
Builds are added by hydra-eval-jobset which is executed concurrently.
Unfortunately it appears that because identifier generation is /not/ tied to transaction commit (intentionally),
it's possible for the queue monitor to see builds with identifiers greater than builds that are currently not yet committed.
(I am not a database expert but googling around on the subject turns up numerous articles describing the intricacies of this behavior)
I'm not quite sure how to best fix this, unfortunately, but am hoping once the issue is agreed to be real (and my diagnosis passes your peer review) a solution can be worked towards :).
The "dumb" solution I'm trying to improve upon (surely it's possible, right?) is to simply periodically re-scan the entire queue for missed jobs. It's unsatisfying but since this is a relatively rare occurrence this approach would probably solve the job well enough in practice.
Other solutions involve aspects of table locking and database schema design that frankly I think others are probably much more qualified to discuss :).
Thanks for your time and reading this issue ❤️
The text was updated successfully, but these errors were encountered:
@dtzWill Great catch! I've fixed it by having hydra-eval-jobset tell hydra-queue-runner about the lowest build ID it just added so it can load the queue from there.
Interesting, I've opened #366 before concurrent evaluator was written, but I'm interested to see if this fixes the issue (I currently run a cronjob restarting the queue runner)
After months on-and-off debugging why my Hydra instance periodically decides to ignore jobs in the queue, I think I have something (cc #433).
Jobs are processed properly if/when hydra-queue-runner is restarted or a build is restarted (suggested explanation).
I believe this is due to an incorrect assumption about transaction commit order and the auto-incrementing values used for build id (
serial
in postgresql).Hydra's queue monitor maintains a
lastBuildId
value that essentially tracks the largest ID of all processed builds. This is used to greatly reduce the burden of checking the build table by only looking for builds that have an identifier larger than this value (relying on the auto-increment property of row insertion on theid
column).Builds are added by
hydra-eval-jobset
which is executed concurrently.Unfortunately it appears that because identifier generation is /not/ tied to transaction commit (intentionally),
it's possible for the queue monitor to see builds with identifiers greater than builds that are currently not yet committed.
(I am not a database expert but googling around on the subject turns up numerous articles describing the intricacies of this behavior)
I'm not quite sure how to best fix this, unfortunately, but am hoping once the issue is agreed to be real (and my diagnosis passes your peer review) a solution can be worked towards :).
The "dumb" solution I'm trying to improve upon (surely it's possible, right?) is to simply periodically re-scan the entire queue for missed jobs. It's unsatisfying but since this is a relatively rare occurrence this approach would probably solve the job well enough in practice.
Other solutions involve aspects of table locking and database schema design that frankly I think others are probably much more qualified to discuss :).
Thanks for your time and reading this issue ❤️
The text was updated successfully, but these errors were encountered: