-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve task polling and timeout handling #2593
Conversation
In the long run, I would quite like to separate the concept of tasks and jobs, (e.g. #2241 (comment)). With that, polling should be a job management functionality. I'd quite like to have a way to manage polling schedules using a centralised poller. A job will propose its polling schedule to the poller, but the poller will decide the best times - so it is able to bunch together related polls (e.g. task jobs on the same cluster) in its schedules. |
(Codacy failure will be fixed by #2567.) |
(Travis CI has passed but is reporting in progress here for some reason.) |
I've just received a bug report for a production system - the problem probably resulted from doing all post-timeout polls immediately as described in #2568, instead of waiting on the delay intervals. It would be good to have a test to confirm that polling occurs after the configured intervals. |
@hjoliver Yes, I am working on turning a manual test to an automated one. Not quite there yet. |
Great, thanks! |
Activity of one task
This looks pretty good, but I wonder if we could have:
|
We should also mention execution time limit polling intervals under execution time limit in the CUG global.rc reference. |
Modified the logging logic to do:
Example:
|
(Rebased + conflict resolved. New test moved from |
Damn, I forgot about this bug! To my mind, this one is worse than the other bugs fixed for the just-released 7.6.1. @matthewrmshin and @oliver-sanders - we should get this reviewed and merged quickly, if possible, and bang out 7.6.2. |
#2606 should make this one less lethal, but yes, I agree this fix should go in as soon as possible. |
lib/cylc/task_pool.py
Outdated
key1, submit_num = ctx_key | ||
key = (key1, cycle, name, submit_num) | ||
self.task_events_mgr.event_timers[key] = TaskActionTimer( | ||
if ctx_key == "poll_timer": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx_key = "?"
if ctx_key == "poll_timer":
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should now be fixed.
(Adding myself as reviewer upon @matthewrmshin's request in person). |
Still considering the core logical intention & code changes, but in the meantime I noticed some of the tests could be rewritten in my opinion in a more explicit & adaptable way - see side-PR referenced. |
Branch re-based. Conflicts resolved. |
Branch re-based. Conflicts resolved. |
[runtime] | ||
[[foo]] | ||
script = """ | ||
#rm "$CYLC_TASK_LOG_ROOT.status" # Disable polling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#rm
seems to have unintentionally been left in here from development/debugging stages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now removed the #rm
command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sensible change which works as described & fixes associated bug. Passes my local test battery & new tests suitable. A few minor comments which might be noteworthy (sorry, meant to include the previous comment with these but pressed the wrong confirm button).
batch_sys_conf = {} | ||
time_limit_delays = batch_sys_conf.get( | ||
'execution time limit polling intervals', [60, 120, 420]) | ||
timeout = time_limit + sum(time_limit_delays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception catch on 908 will result in time_limit_delays
getting set to None
, which will throw a TypeError
for the sum()
method, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if batch_sys_conf
is an empty dict, time_limit_delays
will be set to the default value [60, 120, 420]
via the 2nd argument of the .get
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, my mistake!
delays.extend([delays[-1]] * size) | ||
time_limit_delays[0] += time_limit - sum(delays) | ||
delays += time_limit_delays | ||
else: # if itask.state.status == TASK_STATUS_SUBMITTED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an i.e.
or similar to this comment as I went to check the nature of TASK_STATUSES_ACTIVE
set to work out whether it was meant to be there thinking it e.g. might have been left in from the development stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here is deliberate. It is saying that the else:
statement:
else: # if itask.state.status == TASK_STATUS_SUBMITTED:
is equivalent to:
elif itask.state.status == TASK_STATUS_SUBMITTED:
The bare else:
is faster because it does not have to evaluate itask.state.status == TASK_STATUS_SUBMITTED
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood this, my point is that it wasn't immediately obvious to me that this comment was implying equivalence (hence the suggestion 'i.e.') as opposed to it being something left there during dev. Though it could easily just be me it was not at first evident to! Sorry if I was not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point @sadielbartholomew but I think we use this idiom fairly often (possibly omitting the "if" in the comment though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, it is clearly confusion arising from my lack of experience - when I see similar comments I will now know distinctly what they mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No apologies necessary! (and I wouldn't too surprised if you see a few things we do that aren't clear or correct).
lib/cylc/task_events_mgr.py
Outdated
# Set timeout | ||
timeref = None # reference time, submitted or started time | ||
timeout = None # timeout in setting | ||
delays = [] # polling intervals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell it is unnecessary to initialise delays
as a list given the get_host_conf()
method will always be called & with default=[900]
specified, which will get returned from that method if all else 'fails'.
message = 'health check settings: %s=%s' % (timeout_key, timeout_str) | ||
# Attempt to group idenitical consecutive delays as N*DELAY,... | ||
if itask.poll_timer.delays: | ||
items = [] # [(number of item - 1, item), ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
940:951 is nice! A clever way to process & report the delays.
Branch squashed and re-based. Conflicts resolved. |
(I've started re-reviewing this) (note new conflicts). |
One variable to hold timeout and one variable to hold a poll timer. Populate timeout time and poll timer only when task is in appropriate state. Reset timeout variable after use. Combine execution polling schedule with execution time limit polling schedule.
Info on submission/execution timeout. Info on polling schedule. Info on issuing a poll, and estimated delay for next one.
Add link to `execution time limit polling intervals`.
(mostly reviewed, I'm intending to finish it this evening) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problems found, all good.
One variable to hold timeout and one variable to hold a poll timer. Populate
timeout time and poll timer only when task is in appropriate state. Reset
timeout variable after use. Combine execution polling schedule with execution
time limit polling schedule.
Not quite a full rewrite, but should be sufficient to fix #2568.