Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not for merge] Status poller rearrangement patch stack #3295

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

benclifford
Copy link
Collaborator

@benclifford benclifford commented Mar 26, 2024

This is a patch stack with a bunch of provider status rearrangement.

The primary user-related motivation is to fix issues #3235 and #2627

Another goal is to rationalise the handling of multiple sources of block/job status information.

As of the time of writing, this PR contains many small steps that culminate in fixing (I hope) both of those issues, but leaves the scaling code still needing cosmetic tidyup. It looks like PollItem is now a facade that mostly handles reporting things to the monitoring system, which could also be moved into the status handling executor code...

This patch stack is managed as an stgit patch stack on my (@benclifford ) laptop so don't go pushing things to this branch because that's a hassle for me to deal with.

Because the scaling code is quite hard to comprehensively understand, and other attempts to change this code have shown to be very hard for everyone to review, I would like to merge this code patch-by-patch, each one its own PR, with it being clearly defined for each one if I am expecting any behaviour change, and if so what that behaviour change is, ideally accompanied by tests; and I would like reviewers to pay attention to the behaviour of each PR, rather than assuming its probably right.

When reviewing each PR individually, this current PR should serve as end-goal context for why a change is being made. Each commit on this branch is a future-PR-in-preparation and should make sense on its own.

@benclifford benclifford force-pushed the benc-status-refactor branch 2 times, most recently from 6ecc4a8 to 3518372 Compare March 26, 2024 11:16
@benclifford benclifford changed the title [not for merge] status refactor [not for merge] Status poller rearrangement patch stack Mar 26, 2024
@benclifford benclifford force-pushed the benc-status-refactor branch 2 times, most recently from 15ad3f1 to 0460015 Compare March 28, 2024 16:48
@benclifford benclifford force-pushed the benc-status-refactor branch 4 times, most recently from da14f7a to e38ea68 Compare April 9, 2024 20:07
@benclifford benclifford force-pushed the benc-status-refactor branch from e38ea68 to 9fa50e5 Compare July 16, 2024 14:57
block ID and job ID mappings contain the full historical list of blocks, but prior to this PR, the mapping was used as source of current jobs that should be scaled in
…that it exists?

specifically raised by khk in the context of def scale_in
…nits" that only make sense in proportional to other scaling load amounts (i.e. ratios) - htex uses "tasks" as the unit. wq now uses "cores" as the unit.

variables and text inside strategy.py should explain this. variables and docstrings should be clearer about this.
changes: none
TODO: this reveals a possible bug here that FAILED entries in simulated status are not immedaitely sent, but instead only get sent
at the next poller update? unless submitted entries which are sent immediately? that should be an easy fix after this PR, though...
…caling strategy will see submitted jobs immediately, before a provider status refresh happens. this makes the scaling code immediately aware of what just happened, rather than for one poll period acting as if nothing had happened.

When making changes that will later be reflected in the _status table, then those changes should be immediately also be made in the cached _status table eplciitly.

before this PR, this code path does not happen in the case of a failed submission, where a failure status will appear when a refresh happens, but not before. in that case the scaling code will act as if the failed submission did not happen and will continue to submit repeatedly until a refresh happens much later.  this PR makes _status be updated in this case too.

fixes 3235
it can be subclassed to add in executor-specific status (as happens with htex)
@benclifford benclifford force-pushed the benc-status-refactor branch from 9fa50e5 to 92b0c46 Compare July 27, 2024 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant