Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue monitor skips builds sometimes: invalid assumption re:incrementing build id? #496

Closed
dtzWill opened this issue Jul 21, 2017 · 3 comments

Comments

@dtzWill
Copy link
Member

dtzWill commented Jul 21, 2017

After months on-and-off debugging why my Hydra instance periodically decides to ignore jobs in the queue, I think I have something (cc #433).
Jobs are processed properly if/when hydra-queue-runner is restarted or a build is restarted (suggested explanation).

I believe this is due to an incorrect assumption about transaction commit order and the auto-incrementing values used for build id (serial in postgresql).

Hydra's queue monitor maintains a lastBuildId value that essentially tracks the largest ID of all processed builds. This is used to greatly reduce the burden of checking the build table by only looking for builds that have an identifier larger than this value (relying on the auto-increment property of row insertion on the id column).

Builds are added by hydra-eval-jobset which is executed concurrently.

Unfortunately it appears that because identifier generation is /not/ tied to transaction commit (intentionally),
it's possible for the queue monitor to see builds with identifiers greater than builds that are currently not yet committed.

(I am not a database expert but googling around on the subject turns up numerous articles describing the intricacies of this behavior)


I'm not quite sure how to best fix this, unfortunately, but am hoping once the issue is agreed to be real (and my diagnosis passes your peer review) a solution can be worked towards :).

The "dumb" solution I'm trying to improve upon (surely it's possible, right?) is to simply periodically re-scan the entire queue for missed jobs. It's unsatisfying but since this is a relatively rare occurrence this approach would probably solve the job well enough in practice.

Other solutions involve aspects of table locking and database schema design that frankly I think others are probably much more qualified to discuss :).

Thanks for your time and reading this issue ❤️

@edolstra
Copy link
Member

@dtzWill Great catch! I've fixed it by having hydra-eval-jobset tell hydra-queue-runner about the lowest build ID it just added so it can load the queue from there.

@dtzWill
Copy link
Member Author

dtzWill commented Jul 21, 2017

Great, thanks for the quick fix! 🥇

@domenkozar
Copy link
Member

Interesting, I've opened #366 before concurrent evaluator was written, but I'm interested to see if this fixes the issue (I currently run a cronjob restarting the queue runner)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants