-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle task "late" events #2597
Conversation
9c338a6
to
1bbf8ab
Compare
I can see how your use of |
Should be easy enough to implement if I understand this correctly.
|
No, what I meant was easier than that (if I understand you correctly!): Every waiting task proxy should emit a late event (once) if the wall-clock time is found to be greater than its clock-trigger time, whether or not it has other unmet prerequisites. This implies, I think, we need some threshold for lateness, since even in a caught-up suite clock-triggered tasks will likely do their final clock check fractionally late. The threshold could be configurable or based on the current main loop interval? Future instances of the same task that may also be late already cannot emit a late event yet because their task proxies don't exist yet to check the time, but I don't think that matters - late events will come from the most recent instance of a task that could possibly run at the moment given the progress of the suite to date. And if one task instance is really late, you might expect it's successor to be late too, and so on (depending on context). This probably implies each individual task proxy has to remember if it has emitted a late event yet (and that should persist else we'll get repeat late events after a restart). |
OK. I think I get what you mean. |
1bbf8ab
to
7365210
Compare
7365210
to
c74a767
Compare
Maybe we still haven't understood each other on what exactly these late events should be, but this doesn't work as I expected. You global prev_late_time check results in only a single late event being emitted for all late tasks. Taking the simplest possible clock-triggered suite as an example:
What I'd expect is, tasks foo.2015 through foo.2018 should each emit a late event shortly after they each come into existence, then foo.2019 onward will submit on time with no further late events emitted. But on this branch, only foo.2015 emits a late event. |
That's not the intention, but you are right about the |
c74a767
to
1a04cb8
Compare
Latest implementation should do it! |
That's got it. Thanks. |
@matthewrmshin - I've made an attempt at documenting this feature, which I think is not going to be easy to explain to users... I'll post a PR to your branch later tonight. |
I have one general comment (I'm starting to look at the code & will submit a review this afternoon). This is currently being implemented in a 'one-size-fits-all' way, but given different users (ops vs. research vs. non-meteorological etc) will surely naturally to treat 'late' events with different levels of urgency would it be worth the effort to include a small level of customisability here? I feel it would be useful, & easy enough, to make the 'threshold' time past the clock trigger at which the event is handled (that is currently set to 1 min, a sensible default value I think) changeable via a late handler 'option' or similar to some preferred time (to include zero), & also to override the logging level to 'INFO' as for some always logging as a 'WARNING' could get irritating &/or clog up their log viewer so that contextually more important warnings become hard to spot or missed. If you think this generally a good idea, we could also consider implementing the means to disable the characteristic discussed here & in the original issue whereby tasks with clock triggers subsequent to suite start-up (or restart) time emit late events. It does seem to me that last year when it was originally discussed everyone could at least see the merits of 'historical' tasks not emitting, though we now want them to; if there could be a significant divide in preference for this why not effectively provide a choice? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, but in general this works as discussed & the tests ensure so in a valid way. I really like how you implemented the solution to the issue at hand by the way, especially the changes to the functions in task_proxy.py
! Please read my comment above, but it is just a suggestion so merge as is if you like (approving).
lib/cylc/scheduler.py
Outdated
itask.state.status in TASK_STATUSES_NEVER_ACTIVE and | ||
this_late_time > itask.clock_trigger_time): | ||
itask.is_late = True | ||
msg = '%s (clock-trigger-time=%s)' % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A space at the start of the msg
string to output WARNING - [half_past.20180320T0700Z] - late (clock-trigger-time= ...
instead of WARNING - [half_past.20180320T0700Z] -late (clock-trigger-time= ...
etc is clearer (or at least nicer) formatting I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but we'll need to change many other statements as well, for consistency, so I think it should be done separately. Alternatively, we should probably remove the hyphen before the message. It does not serve any purpose, but this may break things that rely on a specific format of the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough!
lib/cylc/task_state.py
Outdated
TASK_STATUSES_NEVER_ACTIVE = set([ | ||
TASK_STATUS_RUNAHEAD, | ||
TASK_STATUS_WAITING, | ||
TASK_STATUS_HELD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has TASK_STATUS_HELD
been included here? Otherwise, could we not formulate this set in terms of the existing (intuitively related) sets with the equivalent TASK_STATUSES_ALL - TASK_STATUSES_WITH_JOB_SCRIPT - TASK_STATUSES_FINAL
? If anything to avoid adding more explicit sets to an already considerable list - though it depends on if, and if so how often, these sets gets changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The held
state is almost always imposed manually, so their lateness should not be reported until they are released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point that is not entirely obvious - can you add a comment for future reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But surely then the TASK_STATUS_HELD
entry should be omitted from this set, as the late_tasks_check()
function tests whether a task has a state belonging to the set (scheduler.py
current line 1214) as part of the conditional for establishing 'lateness'? I assume from the UG that releasing a 'held' task will always revert the state to some non-active one i.e. one already in this set, so to not report until release would require its exclusion. Not sure what I have missed here!
lib/cylc/suite_db_mgr.py
Outdated
{"key": "final_point", "value": str(final_point)}, | ||
{"key": "run_mode", "value": schd.run_mode}, | ||
{"key": "cylc_version", "value": CYLC_VERSION}, | ||
{"key": "UTC_mode", "value": cylc.flags.utc}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For purposes of my own learning (this code is from a recent PR of my own) why is it better to store e.g. the UTC mode to the database as a zero/one equivalent instead of a Boolean value? In this case UTC_mode
is natively i.e. as defined in the suite.rc
file of Boolean type, hence why I thought it best to convert to format it as such using str()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLite does not have a native boolean type. In such case, it is normally best to represent boolean data using the integers 0|1
, which can be used in boolean context without any conversion. (The expression bool('False')
evaluates to True
in Python, whereas bool(0)
evaluates to False
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining :)
lib/cylc/task_proxy.py
Outdated
.clock_trigger_time (float): | ||
Clock trigger time in seconds since epoch. | ||
.has_spawned (boolean): | ||
Has this task spawned its successor in the sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing .
--> ?
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Revoking my approval pending outcome of forthcoming discussions!)
@sadielbartholomew - I'll just weigh in on your suggestions above.
|
problem! @cylc/core - if you read the additions to
... therefore, we cannot get useful late alerting out of this new feature as implemented. (apologies @matthewrmshin for failing to realise this before you did the implementation! - however, most of your code will work for what follows...) |
solution? What operators really want (I believe) is to be able to say, "in our environment task X normally triggers at time T, and if it has not done so by T+n alert us that something might be wrong". This is how they use external systems like Nagios to monitor suite progress. The problem for Cylc is, the suite itself does not entirely dictate the value of time T (except in the case of clock-triggered tasks - but see above). In principle we could try to predict T by traversing the graph since the last clock-triggered task and adding up execution time limits (if defined) - but that would miss any external constraints on suite progress (waiting in PBS queues...). Ultimately we could get Cylc to remember the actual wall-clock trigger time of all tasks (relative to their cycle points) and do late alerting based on deviations from the mean - under the assumption that operational systems are reasonably consistent in this respect (which should be true for the big forecast tasks at least). For now though, I propose we simply convert this PR to emit late events based on a new task attribute Thoughts? |
Should be easy enough to implement. |
Thanks for your response to my questions, @hjoliver, & sorry for not picking up on the fact that you were had reservations about the lack of use of the late event in the Absolutely agree that 'configurability that's not really needed just adds unnecessary complexity' & these new points regarding usefulness of the current idea are very significant so glad you brought them up. I have little knowledge of practical operational system management but your solution seems very sensible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New considerations to address
That's alright - I didn't really pick up on the implications of what I was saying myself, until the next morning, after I'd slept on it! |
a8224a9
to
ac27593
Compare
Lateness is no longer related to clock trigger. A |
I'll add something to the CUG when we can finally agree with the approach. |
lib/cylc/scheduler.py
Outdated
@@ -1227,6 +1207,22 @@ def database_health_check(self): | |||
# Something has to be very wrong here, so stop the suite | |||
raise SchedulerError(str(exc)) | |||
|
|||
def late_tasks_check(self): | |||
"""Report tasks that are late for their clock triggers.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc-string is now wrong (latest no longer linked to clock triggers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs to be updated in line with @hjoliver's previous comment, I believe.
I agree with this approach (not surprisingly perhaps!), so we're good to go if you and your team agree. We should keep much of the cug documentation already added, to explain why lateness can't easily be determined automatically. ... I'll have a go at quickly reformulating the docs now and put up a PR to your branch. Also I'm happy with the term "late offset" because unlike the timeouts it is an offset from cycle point rather than an interval after some suite event. |
I also agree with the approach & with 'late offset' as the descriptor (seems intuitive). Let me know when this is ready for re-review. |
6d139f3
to
86bbca7
Compare
Add event settings to a task to report itself as *late" with respect to its (date-time) cycle point.
86bbca7
to
fb800e8
Compare
Branch re-based. Conflicts resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be good to go now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Starting re-review In review, may be a while to complete due to other commitments.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great: works logically & functionally as per @hjoliver's specification & local test pass. I do have some minor comments but nothing regarding code functionality. Merge at will once addressed or otherwise.
lib/cylc/scheduler.py
Outdated
@@ -1227,6 +1207,22 @@ def database_health_check(self): | |||
# Something has to be very wrong here, so stop the suite | |||
raise SchedulerError(str(exc)) | |||
|
|||
def late_tasks_check(self): | |||
"""Report tasks that are late for their clock triggers.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs to be updated in line with @hjoliver's previous comment, I believe.
now = time() | ||
for itask in self.pool.get_tasks(): | ||
# External trigger matching and task expiry must be done | ||
# regardless, so they need to be in separate "if ..." blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comment! Nicely anticipated since I was just about to ask about this.
doc/src/cylc-user-guide/suiterc.tex
Outdated
@@ -1680,6 +1680,7 @@ \subsection{[runtime]} | |||
\item {\bf warning} - the task reported a WARNING severity message | |||
\item {\bf critical} - the task reported a CRITICAL severity message | |||
\item {\bf custom} - the task reported a CUSTOM severity message | |||
\item {\bf late} - the task has been delayed relative to its clock-trigger time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this description also be updated to reflect the fact that lateness is now unrelated to clock triggers?
doc/src/cylc-user-guide/cug.tex
Outdated
@@ -6979,6 +6979,7 @@ \subsection{Task Event Handling} | |||
\item `warning' - the task reported a WARNING severity message | |||
\item `critical' - the task reported a CRITICAL severity message | |||
\item `custom' - the task reported a CUSTOM severity message | |||
\item `late' - the task has been delayed relative to its clock-trigger time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this description also be updated to reflect the fact that lateness is now unrelated to clock triggers?
__slots__ = [ | ||
'cleanup_cutoff', | ||
'clock_trigger_time', | ||
'expire_time', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'expire_time'
is no longer an attribute of this class - apparently it wasn't before this PR, but since the attributes are now listed with descriptions we should clear this up. Line 238 can simply be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was missed in the description. The attribute is still used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attribute now added to doc string.
lib/cylc/scheduler.py
Outdated
self.task_events_mgr.pflag = True | ||
elif key == 'warm_point': | ||
self._cli_start_point_string = value | ||
elif key in ["start_point", "warm_point"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"warm_point"
doesn't seem to be used anywhere now as far as I can tell, so we can simply test for equality with "start_point"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For back compat. Comment added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, of course! I need to improve at considering backward compatibility as a possible reason for seeing code which does not have an obvious rationale. Though good idea for a comment.
All comments addressed. |
Log and emit a late event when a task reaches its clock trigger time
but has unmet prerequisites.(In this implementation, it will only emit the event as the suite runs, so tasks with clock-trigger before the time of current suite start up (or restart) will not get reported. Happy to improve on this behaviour.)In the latest implementation, the suite will log and emit a late event once for each clock trigger task that is late for more than a minute. This should work correctly on start up and restart.In the latest implementation, the suite will log and emit a late event once for each task that has
late offset
configured and that the current time is beyond its cycle point time + the late offset. This should work correctly on start up and restart.Close #2207. Supersede #2594.