Handle task "late" events #2597

matthewrmshin · 2018-03-06T16:00:44Z

Log and emit a late event when a task reaches its clock trigger time ~~but has unmet prerequisites~~.

(In this implementation, it will only emit the event as the suite runs, so tasks with clock-trigger before the time of current suite start up (or restart) will not get reported. Happy to improve on this behaviour.)

~~In the latest implementation, the suite will log and emit a late event once for each clock trigger task that is late for more than a minute. This should work correctly on start up and restart.~~

In the latest implementation, the suite will log and emit a late event once for each task that has late offset configured and that the current time is beyond its cycle point time + the late offset. This should work correctly on start up and restart.

Close #2207. Supersede #2594.

hjoliver · 2018-03-08T22:22:35Z

(In this implementation, it will only emit the event as the suite runs, so tasks with clock-trigger before the time of current suite start up (or restart) will not get reported. Happy to improve on this behaviour.)

I can see how your use of prev_late_time_check = now is a nice clean way to do this, but IMO tasks should emit a late event even if their clock trigger time is not greater than "now" - such tasks are "even later" than those that can't run at their clock-trigger times once the suite has caught up, after all. (Does that make sense?!)

matthewrmshin · 2018-03-09T10:17:51Z

Should be easy enough to implement if I understand this correctly.

Report late tasks on suite start up?
Remember previous late check time on restart?

hjoliver · 2018-03-12T03:52:45Z

No, what I meant was easier than that (if I understand you correctly!):

Every waiting task proxy should emit a late event (once) if the wall-clock time is found to be greater than its clock-trigger time, whether or not it has other unmet prerequisites.

This implies, I think, we need some threshold for lateness, since even in a caught-up suite clock-triggered tasks will likely do their final clock check fractionally late. The threshold could be configurable or based on the current main loop interval?

Future instances of the same task that may also be late already cannot emit a late event yet because their task proxies don't exist yet to check the time, but I don't think that matters - late events will come from the most recent instance of a task that could possibly run at the moment given the progress of the suite to date. And if one task instance is really late, you might expect it's successor to be late too, and so on (depending on context).

This probably implies each individual task proxy has to remember if it has emitted a late event yet (and that should persist else we'll get repeat late events after a restart).

matthewrmshin · 2018-03-13T20:56:01Z

OK. I think I get what you mean.

hjoliver · 2018-03-16T00:25:48Z

Maybe we still haven't understood each other on what exactly these late events should be, but this doesn't work as I expected. You global prev_late_time check results in only a single late event being emitted for all late tasks. Taking the simplest possible clock-triggered suite as an example:

[scheduling]
  initial cycle point = 2015
  [[special tasks]]
      clock-trigger = foo(PT0H)
  [[dependencies]]
      [[[P1Y]]]
          graph = "foo[-P1Y] => foo"
[runtime]
   [[foo]]
      script = sleep 10
      [[[events]]]
          late handler = "echo !!!LATE"

What I'd expect is, tasks foo.2015 through foo.2018 should each emit a late event shortly after they each come into existence, then foo.2019 onward will submit on time with no further late events emitted.
This seems the most sensible way of late-alerting to me (even though foo.2016, say, has technically been late ever since Jan 2016, its lateness could not be alerted before now because the suite was not even running until now ... or because the suite was running so far behind that it had not become aware of .2016 tasks until now).

But on this branch, only foo.2015 emits a late event.

matthewrmshin · 2018-03-16T10:23:25Z

That's not the intention, but you are right about the prev_late_time flag. It is because foo.2016..8 are not yet spawned when we first run through the main loop, so the global prev_late_time has moved from epoch to a much more current time. My intention is to avoid adding another table/column to the runtime database (which can cause compatibility issues), but perhaps this is unavoidable.

matthewrmshin · 2018-03-16T15:22:06Z

Latest implementation should do it!

hjoliver · 2018-03-16T21:33:33Z

That's got it. Thanks.

hjoliver · 2018-03-20T05:22:55Z

@matthewrmshin - I've made an attempt at documenting this feature, which I think is not going to be easy to explain to users... I'll post a PR to your branch later tonight.

sadielbartholomew · 2018-03-20T12:46:07Z

I have one general comment (I'm starting to look at the code & will submit a review this afternoon). This is currently being implemented in a 'one-size-fits-all' way, but given different users (ops vs. research vs. non-meteorological etc) will surely naturally to treat 'late' events with different levels of urgency would it be worth the effort to include a small level of customisability here?

I feel it would be useful, & easy enough, to make the 'threshold' time past the clock trigger at which the event is handled (that is currently set to 1 min, a sensible default value I think) changeable via a late handler 'option' or similar to some preferred time (to include zero), & also to override the logging level to 'INFO' as for some always logging as a 'WARNING' could get irritating &/or clog up their log viewer so that contextually more important warnings become hard to spot or missed.

If you think this generally a good idea, we could also consider implementing the means to disable the characteristic discussed here & in the original issue whereby tasks with clock triggers subsequent to suite start-up (or restart) time emit late events. It does seem to me that last year when it was originally discussed everyone could at least see the merits of 'historical' tasks not emitting, though we now want them to; if there could be a significant divide in preference for this why not effectively provide a choice?

sadielbartholomew

Minor comments, but in general this works as discussed & the tests ensure so in a valid way. I really like how you implemented the solution to the issue at hand by the way, especially the changes to the functions in task_proxy.py! Please read my comment above, but it is just a suggestion so merge as is if you like (approving).

sadielbartholomew · 2018-03-20T13:01:57Z

lib/cylc/scheduler.py

+                    itask.state.status in TASK_STATUSES_NEVER_ACTIVE and
+                    this_late_time > itask.clock_trigger_time):
+                itask.is_late = True
+                msg = '%s (clock-trigger-time=%s)' % (


A space at the start of the msg string to output WARNING - [half_past.20180320T0700Z] - late (clock-trigger-time= ... instead of WARNING - [half_past.20180320T0700Z] -late (clock-trigger-time= ... etc is clearer (or at least nicer) formatting I think.

That's true, but we'll need to change many other statements as well, for consistency, so I think it should be done separately. Alternatively, we should probably remove the hyphen before the message. It does not serve any purpose, but this may break things that rely on a specific format of the log.

Fair enough!

sadielbartholomew · 2018-03-20T15:52:11Z

lib/cylc/task_state.py

+TASK_STATUSES_NEVER_ACTIVE = set([
+    TASK_STATUS_RUNAHEAD,
+    TASK_STATUS_WAITING,
+    TASK_STATUS_HELD,


Why has TASK_STATUS_HELD been included here? Otherwise, could we not formulate this set in terms of the existing (intuitively related) sets with the equivalent TASK_STATUSES_ALL - TASK_STATUSES_WITH_JOB_SCRIPT - TASK_STATUSES_FINAL? If anything to avoid adding more explicit sets to an already considerable list - though it depends on if, and if so how often, these sets gets changed.

The held state is almost always imposed manually, so their lateness should not be reported until they are released.

That's a good point that is not entirely obvious - can you add a comment for future reference?

But surely then the TASK_STATUS_HELD entry should be omitted from this set, as the late_tasks_check() function tests whether a task has a state belonging to the set (scheduler.py current line 1214) as part of the conditional for establishing 'lateness'? I assume from the UG that releasing a 'held' task will always revert the state to some non-active one i.e. one already in this set, so to not report until release would require its exclusion. Not sure what I have missed here!

sadielbartholomew · 2018-03-20T16:22:18Z

lib/cylc/suite_db_mgr.py

-            {"key": "final_point", "value": str(final_point)},
+            {"key": "run_mode", "value": schd.run_mode},
+            {"key": "cylc_version", "value": CYLC_VERSION},
+            {"key": "UTC_mode", "value": cylc.flags.utc},


For purposes of my own learning (this code is from a recent PR of my own) why is it better to store e.g. the UTC mode to the database as a zero/one equivalent instead of a Boolean value? In this case UTC_mode is natively i.e. as defined in the suite.rc file of Boolean type, hence why I thought it best to convert to format it as such using str().

SQLite does not have a native boolean type. In such case, it is normally best to represent boolean data using the integers 0|1, which can be used in boolean context without any conversion. (The expression bool('False') evaluates to True in Python, whereas bool(0) evaluates to False.)

Thanks for explaining :)

sadielbartholomew · 2018-03-20T16:36:46Z

lib/cylc/task_proxy.py

+        .clock_trigger_time (float):
+            Clock trigger time in seconds since epoch.
+        .has_spawned (boolean):
+            Has this task spawned its successor in the sequence.


Consider changing . --> ? for consistency.

hjoliver

(Revoking my approval pending outcome of forthcoming discussions!)

hjoliver · 2018-03-20T19:02:24Z

@sadielbartholomew - I'll just weigh in on your suggestions above.

Personally I'm happy with "warning" severity for late events - it's not a normal event, but it's not critical either. Configurability that's not really needed just adds unnecessary complexity.
Making the threshold configurable is probably a good idea because clock-triggers are typically set at the earliest time that a task should run, and 1 minute after that is probably not a big deal. However, I'm about to suggest a more fundamental change below that would make this unnecessary.
Disabling late events for tasks whose cycle points pre-date the suite restart time might be sensible, and maybe we should do it (I'd need to think more on this...) but in any case it's a minor issue because late alerting is really only needed in operational contexts, and operational suites never(?) get more than a whole cycle behind. And if running the same suite in research mode, you can just turn the alerts off.

hjoliver · 2018-03-20T19:21:34Z

problem!

@cylc/core - if you read the additions to cug.tex this PR (which came to @matthewrmshin from me) you'll see that while trying to explain how to use this feature, I more or less convinced myself that it is useless and ended up recommending use of an external monitoring system to determine lateness (which is what we already do at NIWA). That's because:

Cylc can only absolutely identify lateness in clock-triggered tasks,
but clock-triggered tasks are typically not the ones that need late alerting
and even if they were the ones, they would typically not generate late events during typical operational delays of less than one cycle, because they typically have no upstream dependencies and can therefore always run on time (so long as the task proxy exists already).
and finally (as explained in cug.tex) it is a bad idea to put artificial clock-triggers on the "important" tasks, or to put artificial dependencies on the natural clock-triggered tasks.

... therefore, we cannot get useful late alerting out of this new feature as implemented. (apologies @matthewrmshin for failing to realise this before you did the implementation! - however, most of your code will work for what follows...)

hjoliver · 2018-03-20T19:48:05Z

solution?

What operators really want (I believe) is to be able to say, "in our environment task X normally triggers at time T, and if it has not done so by T+n alert us that something might be wrong". This is how they use external systems like Nagios to monitor suite progress.

The problem for Cylc is, the suite itself does not entirely dictate the value of time T (except in the case of clock-triggered tasks - but see above). In principle we could try to predict T by traversing the graph since the last clock-triggered task and adding up execution time limits (if defined) - but that would miss any external constraints on suite progress (waiting in PBS queues...).

Ultimately we could get Cylc to remember the actual wall-clock trigger time of all tasks (relative to their cycle points) and do late alerting based on deviations from the mean - under the assumption that operational systems are reasonably consistent in this respect (which should be true for the big forecast tasks at least).

For now though, I propose we simply convert this PR to emit late events based on a new task attribute late offset (or similar) that is expressed exactly like a clock trigger (as some offset from task cycle point) but has nothing to do with triggering. This would be exactly as effective as what people are doing already, but has the advantage of not requiring any external system like Nagios.

Thoughts?

matthewrmshin · 2018-03-20T20:35:58Z

Should be easy enough to implement.

sadielbartholomew · 2018-03-21T10:58:49Z

Thanks for your response to my questions, @hjoliver, & sorry for not picking up on the fact that you were had reservations about the lack of use of the late event in the cug.tex commit which was merged into this PR shortly before I commented, I skim read it & took it as advice that these late events should be used sparingly.

Absolutely agree that 'configurability that's not really needed just adds unnecessary complexity' & these new points regarding usefulness of the current idea are very significant so glad you brought them up. I have little knowledge of practical operational system management but your solution seems very sensible.

sadielbartholomew

New considerations to address

hjoliver · 2018-03-21T11:14:39Z

sorry for not picking up on the fact that you were had reservations...

That's alright - I didn't really pick up on the implications of what I was saying myself, until the next morning, after I'd slept on it!

matthewrmshin · 2018-03-23T16:18:50Z

Lateness is no longer related to clock trigger. A late offset setting is used to determine if a task is late compared to its cycle point or not. (Or should it be called waiting timeout, like submission timeout and execution timeout`? ) ⌚ ⌛

matthewrmshin · 2018-03-23T16:27:28Z

I'll add something to the CUG when we can finally agree with the approach.

hjoliver · 2018-04-02T22:44:17Z

lib/cylc/scheduler.py

@@ -1227,6 +1207,22 @@ def database_health_check(self):
            # Something has to be very wrong here, so stop the suite
            raise SchedulerError(str(exc))

+    def late_tasks_check(self):
+        """Report tasks that are late for their clock triggers."""


doc-string is now wrong (latest no longer linked to clock triggers).

This still needs to be updated in line with @hjoliver's previous comment, I believe.

hjoliver · 2018-04-02T23:16:04Z

I'll add something to the CUG when we can finally agree with the approach.

I agree with this approach (not surprisingly perhaps!), so we're good to go if you and your team agree.

We should keep much of the cug documentation already added, to explain why lateness can't easily be determined automatically. ... I'll have a go at quickly reformulating the docs now and put up a PR to your branch.

Also I'm happy with the term "late offset" because unlike the timeouts it is an offset from cycle point rather than an interval after some suite event.

sadielbartholomew · 2018-04-05T11:08:08Z

I also agree with the approach & with 'late offset' as the descriptor (seems intuitive). Let me know when this is ready for re-review.

Add event settings to a task to report itself as *late" with respect to its (date-time) cycle point.

matthewrmshin · 2018-04-24T11:23:20Z

Branch re-based. Conflicts resolved.

hjoliver

Seems to be good to go now.

sadielbartholomew

(~~Starting re-review~~ In review, may be a while to complete due to other commitments.)

sadielbartholomew

Looks great: works logically & functionally as per @hjoliver's specification & local test pass. I do have some minor comments but nothing regarding code functionality. Merge at will once addressed or otherwise.

sadielbartholomew · 2018-04-26T12:58:33Z

lib/cylc/scheduler.py

@@ -1227,6 +1207,22 @@ def database_health_check(self):
            # Something has to be very wrong here, so stop the suite
            raise SchedulerError(str(exc))

+    def late_tasks_check(self):
+        """Report tasks that are late for their clock triggers."""


This still needs to be updated in line with @hjoliver's previous comment, I believe.

sadielbartholomew · 2018-04-26T13:21:22Z

lib/cylc/scheduler.py

+        now = time()
+        for itask in self.pool.get_tasks():
+            # External trigger matching and task expiry must be done
+            # regardless, so they need to be in separate "if ..." blocks.


Good comment! Nicely anticipated since I was just about to ask about this.

sadielbartholomew · 2018-04-26T13:33:00Z

doc/src/cylc-user-guide/suiterc.tex

@@ -1680,6 +1680,7 @@ \subsection{[runtime]}
    \item {\bf warning}        - the task reported a WARNING severity message
    \item {\bf critical}       - the task reported a CRITICAL severity message
    \item {\bf custom}       - the task reported a CUSTOM severity message
+    \item {\bf late}       - the task has been delayed relative to its clock-trigger time


Shouldn't this description also be updated to reflect the fact that lateness is now unrelated to clock triggers?

sadielbartholomew · 2018-04-26T13:33:10Z

doc/src/cylc-user-guide/cug.tex

@@ -6979,6 +6979,7 @@ \subsection{Task Event Handling}
    \item `warning' - the task reported a WARNING severity message
    \item `critical' - the task reported a CRITICAL severity message
    \item `custom' - the task reported a CUSTOM severity message
+    \item `late' - the task has been delayed relative to its clock-trigger time


Shouldn't this description also be updated to reflect the fact that lateness is now unrelated to clock triggers?

sadielbartholomew · 2018-04-26T14:13:48Z

lib/cylc/task_proxy.py

+    __slots__ = [
+        'cleanup_cutoff',
+        'clock_trigger_time',
+        'expire_time',


'expire_time' is no longer an attribute of this class - apparently it wasn't before this PR, but since the attributes are now listed with descriptions we should clear this up. Line 238 can simply be removed.

I think it was missed in the description. The attribute is still used.

Attribute now added to doc string.

sadielbartholomew · 2018-04-26T14:27:10Z

lib/cylc/scheduler.py

            self.task_events_mgr.pflag = True
-        elif key == 'warm_point':
-            self._cli_start_point_string = value
+        elif key in ["start_point", "warm_point"]:


"warm_point" doesn't seem to be used anywhere now as far as I can tell, so we can simply test for equality with "start_point".

For back compat. Comment added.

Ah, of course! I need to improve at considering backward compatibility as a possible reason for seeing code which does not have an obvious rationale. Though good idea for a comment.

matthewrmshin · 2018-04-26T19:39:22Z

All comments addressed.

matthewrmshin added this to the soon milestone Mar 6, 2018

matthewrmshin self-assigned this Mar 6, 2018

matthewrmshin force-pushed the late-clock-trigger branch 2 times, most recently from 9c338a6 to 1bbf8ab Compare March 7, 2018 10:39

matthewrmshin force-pushed the late-clock-trigger branch from 1bbf8ab to 7365210 Compare March 15, 2018 12:36

matthewrmshin modified the milestones: soon, next release Mar 15, 2018

matthewrmshin force-pushed the late-clock-trigger branch from 7365210 to c74a767 Compare March 15, 2018 13:02

matthewrmshin requested review from hjoliver and sadielbartholomew March 15, 2018 14:08

matthewrmshin force-pushed the late-clock-trigger branch from c74a767 to 1a04cb8 Compare March 16, 2018 15:19

hjoliver approved these changes Mar 20, 2018

View reviewed changes

sadielbartholomew approved these changes Mar 20, 2018

View reviewed changes

hjoliver requested changes Mar 20, 2018

View reviewed changes

sadielbartholomew requested changes Mar 21, 2018

View reviewed changes

matthewrmshin force-pushed the late-clock-trigger branch from a8224a9 to ac27593 Compare March 23, 2018 16:11

hjoliver reviewed Apr 2, 2018

View reviewed changes

hjoliver mentioned this pull request Apr 3, 2018

Automatic offset computation for late events? #2615

Open

matthewrmshin force-pushed the late-clock-trigger branch from 6d139f3 to 86bbca7 Compare April 17, 2018 15:00

hjoliver and others added 3 commits April 24, 2018 10:52

Document late events.

2e054d1

Event for tasks that are late w.r.t. cycle point

8399b30

Add event settings to a task to report itself as *late" with respect to its (date-time) cycle point.

Updated late event documentation.

fb800e8

matthewrmshin force-pushed the late-clock-trigger branch from 86bbca7 to fb800e8 Compare April 24, 2018 11:22

hjoliver approved these changes Apr 26, 2018

View reviewed changes

sadielbartholomew reviewed Apr 26, 2018

View reviewed changes

sadielbartholomew approved these changes Apr 26, 2018

View reviewed changes

Improve comments and docs

f32e75f

hjoliver merged commit d6a3535 into cylc:master Apr 26, 2018

matthewrmshin deleted the late-clock-trigger branch April 26, 2018 22:06

hjoliver changed the title ~~Handle late clock trigger task event~~ Handle task "late" events May 11, 2018

Handle task "late" events #2597

Handle task "late" events #2597

Conversation

matthewrmshin commented Mar 6, 2018 • edited Loading

hjoliver commented Mar 8, 2018 • edited Loading

matthewrmshin commented Mar 9, 2018

hjoliver commented Mar 12, 2018 • edited Loading

matthewrmshin commented Mar 13, 2018

hjoliver commented Mar 16, 2018

matthewrmshin commented Mar 16, 2018

matthewrmshin commented Mar 16, 2018

hjoliver commented Mar 16, 2018

hjoliver commented Mar 20, 2018

sadielbartholomew commented Mar 20, 2018

sadielbartholomew left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjoliver left a comment

Choose a reason for hiding this comment

hjoliver commented Mar 20, 2018 • edited Loading

hjoliver commented Mar 20, 2018 • edited Loading

hjoliver commented Mar 20, 2018

matthewrmshin commented Mar 20, 2018

sadielbartholomew commented Mar 21, 2018

sadielbartholomew left a comment

Choose a reason for hiding this comment

hjoliver commented Mar 21, 2018

matthewrmshin commented Mar 23, 2018

matthewrmshin commented Mar 23, 2018

hjoliver Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjoliver commented Apr 2, 2018 • edited Loading

sadielbartholomew commented Apr 5, 2018

matthewrmshin commented Apr 24, 2018

hjoliver left a comment

Choose a reason for hiding this comment

sadielbartholomew left a comment • edited Loading

Choose a reason for hiding this comment

sadielbartholomew left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewrmshin commented Apr 26, 2018

matthewrmshin commented Mar 6, 2018 •

edited

Loading

hjoliver commented Mar 8, 2018 •

edited

Loading

hjoliver commented Mar 12, 2018 •

edited

Loading

hjoliver commented Mar 20, 2018 •

edited

Loading

hjoliver commented Mar 20, 2018 •

edited

Loading

hjoliver Apr 2, 2018 •

edited

Loading

hjoliver commented Apr 2, 2018 •

edited

Loading

sadielbartholomew left a comment •

edited

Loading

sadielbartholomew left a comment •

edited

Loading