Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle task "late" events #2597

Merged
merged 4 commits into from
Apr 26, 2018
Merged

Conversation

matthewrmshin
Copy link
Contributor

@matthewrmshin matthewrmshin commented Mar 6, 2018

Log and emit a late event when a task reaches its clock trigger time but has unmet prerequisites.

(In this implementation, it will only emit the event as the suite runs, so tasks with clock-trigger before the time of current suite start up (or restart) will not get reported. Happy to improve on this behaviour.)

In the latest implementation, the suite will log and emit a late event once for each clock trigger task that is late for more than a minute. This should work correctly on start up and restart.

In the latest implementation, the suite will log and emit a late event once for each task that has late offset configured and that the current time is beyond its cycle point time + the late offset. This should work correctly on start up and restart.

Close #2207. Supersede #2594.

@matthewrmshin matthewrmshin added this to the soon milestone Mar 6, 2018
@matthewrmshin matthewrmshin self-assigned this Mar 6, 2018
@matthewrmshin matthewrmshin force-pushed the late-clock-trigger branch 2 times, most recently from 9c338a6 to 1bbf8ab Compare March 7, 2018 10:39
@hjoliver
Copy link
Member

hjoliver commented Mar 8, 2018

(In this implementation, it will only emit the event as the suite runs, so tasks with clock-trigger before the time of current suite start up (or restart) will not get reported. Happy to improve on this behaviour.)

I can see how your use of prev_late_time_check = now is a nice clean way to do this, but IMO tasks should emit a late event even if their clock trigger time is not greater than "now" - such tasks are "even later" than those that can't run at their clock-trigger times once the suite has caught up, after all. (Does that make sense?!)

@matthewrmshin
Copy link
Contributor Author

Should be easy enough to implement if I understand this correctly.

  • Report late tasks on suite start up?
  • Remember previous late check time on restart?

@hjoliver
Copy link
Member

hjoliver commented Mar 12, 2018

No, what I meant was easier than that (if I understand you correctly!):

Every waiting task proxy should emit a late event (once) if the wall-clock time is found to be greater than its clock-trigger time, whether or not it has other unmet prerequisites.

This implies, I think, we need some threshold for lateness, since even in a caught-up suite clock-triggered tasks will likely do their final clock check fractionally late. The threshold could be configurable or based on the current main loop interval?

Future instances of the same task that may also be late already cannot emit a late event yet because their task proxies don't exist yet to check the time, but I don't think that matters - late events will come from the most recent instance of a task that could possibly run at the moment given the progress of the suite to date. And if one task instance is really late, you might expect it's successor to be late too, and so on (depending on context).

This probably implies each individual task proxy has to remember if it has emitted a late event yet (and that should persist else we'll get repeat late events after a restart).

@matthewrmshin
Copy link
Contributor Author

OK. I think I get what you mean.

@hjoliver
Copy link
Member

Maybe we still haven't understood each other on what exactly these late events should be, but this doesn't work as I expected. You global prev_late_time check results in only a single late event being emitted for all late tasks. Taking the simplest possible clock-triggered suite as an example:

[scheduling]
  initial cycle point = 2015
  [[special tasks]]
      clock-trigger = foo(PT0H)
  [[dependencies]]
      [[[P1Y]]]
          graph = "foo[-P1Y] => foo"
[runtime]
   [[foo]]
      script = sleep 10
      [[[events]]]
          late handler = "echo !!!LATE"

What I'd expect is, tasks foo.2015 through foo.2018 should each emit a late event shortly after they each come into existence, then foo.2019 onward will submit on time with no further late events emitted.
This seems the most sensible way of late-alerting to me (even though foo.2016, say, has technically been late ever since Jan 2016, its lateness could not be alerted before now because the suite was not even running until now ... or because the suite was running so far behind that it had not become aware of .2016 tasks until now).

But on this branch, only foo.2015 emits a late event.

@matthewrmshin
Copy link
Contributor Author

That's not the intention, but you are right about the prev_late_time flag. It is because foo.2016..8 are not yet spawned when we first run through the main loop, so the global prev_late_time has moved from epoch to a much more current time. My intention is to avoid adding another table/column to the runtime database (which can cause compatibility issues), but perhaps this is unavoidable.

@matthewrmshin
Copy link
Contributor Author

Latest implementation should do it!

@hjoliver
Copy link
Member

That's got it. Thanks.

@hjoliver
Copy link
Member

@matthewrmshin - I've made an attempt at documenting this feature, which I think is not going to be easy to explain to users... I'll post a PR to your branch later tonight.

@sadielbartholomew
Copy link
Collaborator

I have one general comment (I'm starting to look at the code & will submit a review this afternoon). This is currently being implemented in a 'one-size-fits-all' way, but given different users (ops vs. research vs. non-meteorological etc) will surely naturally to treat 'late' events with different levels of urgency would it be worth the effort to include a small level of customisability here?

I feel it would be useful, & easy enough, to make the 'threshold' time past the clock trigger at which the event is handled (that is currently set to 1 min, a sensible default value I think) changeable via a late handler 'option' or similar to some preferred time (to include zero), & also to override the logging level to 'INFO' as for some always logging as a 'WARNING' could get irritating &/or clog up their log viewer so that contextually more important warnings become hard to spot or missed.

If you think this generally a good idea, we could also consider implementing the means to disable the characteristic discussed here & in the original issue whereby tasks with clock triggers subsequent to suite start-up (or restart) time emit late events. It does seem to me that last year when it was originally discussed everyone could at least see the merits of 'historical' tasks not emitting, though we now want them to; if there could be a significant divide in preference for this why not effectively provide a choice?

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but in general this works as discussed & the tests ensure so in a valid way. I really like how you implemented the solution to the issue at hand by the way, especially the changes to the functions in task_proxy.py! Please read my comment above, but it is just a suggestion so merge as is if you like (approving).

itask.state.status in TASK_STATUSES_NEVER_ACTIVE and
this_late_time > itask.clock_trigger_time):
itask.is_late = True
msg = '%s (clock-trigger-time=%s)' % (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A space at the start of the msg string to output WARNING - [half_past.20180320T0700Z] - late (clock-trigger-time= ... instead of WARNING - [half_past.20180320T0700Z] -late (clock-trigger-time= ... etc is clearer (or at least nicer) formatting I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but we'll need to change many other statements as well, for consistency, so I think it should be done separately. Alternatively, we should probably remove the hyphen before the message. It does not serve any purpose, but this may break things that rely on a specific format of the log.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough!

TASK_STATUSES_NEVER_ACTIVE = set([
TASK_STATUS_RUNAHEAD,
TASK_STATUS_WAITING,
TASK_STATUS_HELD,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has TASK_STATUS_HELD been included here? Otherwise, could we not formulate this set in terms of the existing (intuitively related) sets with the equivalent TASK_STATUSES_ALL - TASK_STATUSES_WITH_JOB_SCRIPT - TASK_STATUSES_FINAL? If anything to avoid adding more explicit sets to an already considerable list - though it depends on if, and if so how often, these sets gets changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The held state is almost always imposed manually, so their lateness should not be reported until they are released.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point that is not entirely obvious - can you add a comment for future reference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But surely then the TASK_STATUS_HELD entry should be omitted from this set, as the late_tasks_check() function tests whether a task has a state belonging to the set (scheduler.py current line 1214) as part of the conditional for establishing 'lateness'? I assume from the UG that releasing a 'held' task will always revert the state to some non-active one i.e. one already in this set, so to not report until release would require its exclusion. Not sure what I have missed here!

{"key": "final_point", "value": str(final_point)},
{"key": "run_mode", "value": schd.run_mode},
{"key": "cylc_version", "value": CYLC_VERSION},
{"key": "UTC_mode", "value": cylc.flags.utc},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For purposes of my own learning (this code is from a recent PR of my own) why is it better to store e.g. the UTC mode to the database as a zero/one equivalent instead of a Boolean value? In this case UTC_mode is natively i.e. as defined in the suite.rc file of Boolean type, hence why I thought it best to convert to format it as such using str().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLite does not have a native boolean type. In such case, it is normally best to represent boolean data using the integers 0|1, which can be used in boolean context without any conversion. (The expression bool('False') evaluates to True in Python, whereas bool(0) evaluates to False.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining :)

.clock_trigger_time (float):
Clock trigger time in seconds since epoch.
.has_spawned (boolean):
Has this task spawned its successor in the sequence.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing . --> ? for consistency.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Revoking my approval pending outcome of forthcoming discussions!)

@hjoliver
Copy link
Member

hjoliver commented Mar 20, 2018

@sadielbartholomew - I'll just weigh in on your suggestions above.

  • Personally I'm happy with "warning" severity for late events - it's not a normal event, but it's not critical either. Configurability that's not really needed just adds unnecessary complexity.
  • Making the threshold configurable is probably a good idea because clock-triggers are typically set at the earliest time that a task should run, and 1 minute after that is probably not a big deal. However, I'm about to suggest a more fundamental change below that would make this unnecessary.
  • Disabling late events for tasks whose cycle points pre-date the suite restart time might be sensible, and maybe we should do it (I'd need to think more on this...) but in any case it's a minor issue because late alerting is really only needed in operational contexts, and operational suites never(?) get more than a whole cycle behind. And if running the same suite in research mode, you can just turn the alerts off.

@hjoliver
Copy link
Member

hjoliver commented Mar 20, 2018

problem!

@cylc/core - if you read the additions to cug.tex this PR (which came to @matthewrmshin from me) you'll see that while trying to explain how to use this feature, I more or less convinced myself that it is useless and ended up recommending use of an external monitoring system to determine lateness (which is what we already do at NIWA). That's because:

  • Cylc can only absolutely identify lateness in clock-triggered tasks,
  • but clock-triggered tasks are typically not the ones that need late alerting
  • and even if they were the ones, they would typically not generate late events during typical operational delays of less than one cycle, because they typically have no upstream dependencies and can therefore always run on time (so long as the task proxy exists already).
  • and finally (as explained in cug.tex) it is a bad idea to put artificial clock-triggers on the "important" tasks, or to put artificial dependencies on the natural clock-triggered tasks.

... therefore, we cannot get useful late alerting out of this new feature as implemented. (apologies @matthewrmshin for failing to realise this before you did the implementation! - however, most of your code will work for what follows...)

@hjoliver
Copy link
Member

solution?

What operators really want (I believe) is to be able to say, "in our environment task X normally triggers at time T, and if it has not done so by T+n alert us that something might be wrong". This is how they use external systems like Nagios to monitor suite progress.

The problem for Cylc is, the suite itself does not entirely dictate the value of time T (except in the case of clock-triggered tasks - but see above). In principle we could try to predict T by traversing the graph since the last clock-triggered task and adding up execution time limits (if defined) - but that would miss any external constraints on suite progress (waiting in PBS queues...).

Ultimately we could get Cylc to remember the actual wall-clock trigger time of all tasks (relative to their cycle points) and do late alerting based on deviations from the mean - under the assumption that operational systems are reasonably consistent in this respect (which should be true for the big forecast tasks at least).

For now though, I propose we simply convert this PR to emit late events based on a new task attribute late offset (or similar) that is expressed exactly like a clock trigger (as some offset from task cycle point) but has nothing to do with triggering. This would be exactly as effective as what people are doing already, but has the advantage of not requiring any external system like Nagios.

Thoughts?

@matthewrmshin
Copy link
Contributor Author

Should be easy enough to implement.

@sadielbartholomew
Copy link
Collaborator

Thanks for your response to my questions, @hjoliver, & sorry for not picking up on the fact that you were had reservations about the lack of use of the late event in the cug.tex commit which was merged into this PR shortly before I commented, I skim read it & took it as advice that these late events should be used sparingly.

Absolutely agree that 'configurability that's not really needed just adds unnecessary complexity' & these new points regarding usefulness of the current idea are very significant so glad you brought them up. I have little knowledge of practical operational system management but your solution seems very sensible.

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New considerations to address

@hjoliver
Copy link
Member

sorry for not picking up on the fact that you were had reservations...

That's alright - I didn't really pick up on the implications of what I was saying myself, until the next morning, after I'd slept on it!

@matthewrmshin
Copy link
Contributor Author

Lateness is no longer related to clock trigger. A late offset setting is used to determine if a task is late compared to its cycle point or not. (Or should it be called waiting timeout, like submission timeout and execution timeout`? ) ⌚ ⌛

@matthewrmshin
Copy link
Contributor Author

I'll add something to the CUG when we can finally agree with the approach.

@@ -1227,6 +1207,22 @@ def database_health_check(self):
# Something has to be very wrong here, so stop the suite
raise SchedulerError(str(exc))

def late_tasks_check(self):
"""Report tasks that are late for their clock triggers."""
Copy link
Member

@hjoliver hjoliver Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string is now wrong (latest no longer linked to clock triggers).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be updated in line with @hjoliver's previous comment, I believe.

@hjoliver
Copy link
Member

hjoliver commented Apr 2, 2018

I'll add something to the CUG when we can finally agree with the approach.

I agree with this approach (not surprisingly perhaps!), so we're good to go if you and your team agree.

We should keep much of the cug documentation already added, to explain why lateness can't easily be determined automatically. ... I'll have a go at quickly reformulating the docs now and put up a PR to your branch.

Also I'm happy with the term "late offset" because unlike the timeouts it is an offset from cycle point rather than an interval after some suite event.

@sadielbartholomew
Copy link
Collaborator

I also agree with the approach & with 'late offset' as the descriptor (seems intuitive). Let me know when this is ready for re-review.

hjoliver and others added 3 commits April 24, 2018 10:52
Add event settings to a task to report itself as *late" with respect to
its (date-time) cycle point.
@matthewrmshin
Copy link
Contributor Author

Branch re-based. Conflicts resolved.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be good to go now.

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Starting re-review In review, may be a while to complete due to other commitments.)

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great: works logically & functionally as per @hjoliver's specification & local test pass. I do have some minor comments but nothing regarding code functionality. Merge at will once addressed or otherwise.

@@ -1227,6 +1207,22 @@ def database_health_check(self):
# Something has to be very wrong here, so stop the suite
raise SchedulerError(str(exc))

def late_tasks_check(self):
"""Report tasks that are late for their clock triggers."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be updated in line with @hjoliver's previous comment, I believe.

now = time()
for itask in self.pool.get_tasks():
# External trigger matching and task expiry must be done
# regardless, so they need to be in separate "if ..." blocks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment! Nicely anticipated since I was just about to ask about this.

@@ -1680,6 +1680,7 @@ \subsection{[runtime]}
\item {\bf warning} - the task reported a WARNING severity message
\item {\bf critical} - the task reported a CRITICAL severity message
\item {\bf custom} - the task reported a CUSTOM severity message
\item {\bf late} - the task has been delayed relative to its clock-trigger time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this description also be updated to reflect the fact that lateness is now unrelated to clock triggers?

@@ -6979,6 +6979,7 @@ \subsection{Task Event Handling}
\item `warning' - the task reported a WARNING severity message
\item `critical' - the task reported a CRITICAL severity message
\item `custom' - the task reported a CUSTOM severity message
\item `late' - the task has been delayed relative to its clock-trigger time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this description also be updated to reflect the fact that lateness is now unrelated to clock triggers?

__slots__ = [
'cleanup_cutoff',
'clock_trigger_time',
'expire_time',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'expire_time' is no longer an attribute of this class - apparently it wasn't before this PR, but since the attributes are now listed with descriptions we should clear this up. Line 238 can simply be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was missed in the description. The attribute is still used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribute now added to doc string.

self.task_events_mgr.pflag = True
elif key == 'warm_point':
self._cli_start_point_string = value
elif key in ["start_point", "warm_point"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"warm_point" doesn't seem to be used anywhere now as far as I can tell, so we can simply test for equality with "start_point".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For back compat. Comment added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course! I need to improve at considering backward compatibility as a possible reason for seeing code which does not have an obvious rationale. Though good idea for a comment.

@matthewrmshin
Copy link
Contributor Author

All comments addressed.

@hjoliver hjoliver merged commit d6a3535 into cylc:master Apr 26, 2018
@matthewrmshin matthewrmshin deleted the late-clock-trigger branch April 26, 2018 22:06
@hjoliver hjoliver changed the title Handle late clock trigger task event Handle task "late" events May 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants