suite stalled warning or notification #1286

kaday · 2015-01-16T13:28:15Z

A suite is using a runahead limit because the suite cycles have no dependancies on previous cycles, but need to be limited due to large data volumes.
It would be helpful that if the suite reaches its runahead limit because of an earlier task has failed that a warning message is recorded in the stderr (and mail is sent as an option) to aid the user to identify why the suite is waiting.

The text was updated successfully, but these errors were encountered:

hjoliver · 2015-01-19T23:50:54Z

@kaday - I'm not sure this is a good idea, although you can try to convince me! Any suite will typically be runahead limited some or all of the time, unless it is being constrained by clock-triggers in real time caught-up operation, and that isn't a condition that warrants a warning - it's merely a mechanism to stop the suite getting unnecessarily spread out, or potentially to prevent too many cycle points running at once for compute-loading reasons.

Furthermore, I'm not sure I understand the example you've given. Failed tasks have no effect on the runahead limit, so presumably you mean that the suite has stalled because a task that depends on the failed one is unable to run? This does indeed hold the runahead limit down, but the suite operator should have got an error message about the failed task, which is the root cause of the problem - is this not sufficient in your opinion?

matthewrmshin · 2015-01-20T09:18:59Z

Is it possible to trigger a special event when the suite can no longer proceed any further?

arjclark · 2015-01-20T10:27:56Z

@matthewrmshin - I guess that would need to be "can no longer proceed any further due to a failed task"? A suite can get to the point where it can't proceed any further due to a running task, which is regular day-to-day behaviour.

arjclark · 2015-01-20T10:29:57Z

e.g for a simple graph = foo suite, where everything's running fine, you aren't in error if you hit the runahead limit due to the earliest foo task still running.

matthewrmshin · 2015-01-20T11:31:48Z

@matthewrmshin - I guess that would need to be "can no longer proceed any further due to a failed task"? A suite can get to the point where it can't proceed any further due to a running task, which is regular day-to-day behaviour.

Yes. It is when the suite requires human input - it is stuck and is no longer anticipating any automatic future events.

hjoliver · 2015-01-20T20:26:24Z

Yes. It is when the suite requires human input - it is stuck and is no longer anticipating any automatic future events.

That may be very hard to determine in general. E.g. a failure recovery task could be waiting on a clock-trigger before it cleans up the failed task and its downstream dependants. So "no longer anticipating any automatic future events" has to include clock-triggers ... but this hourly-cycling suite could have a couple of monthly-cycling tasks that are already waiting on a distant clock-trigger. Worst case scenario of course, some external suite or system might be monitoring the suite and be primed to manipulate it.

It seems to me that task failed event handlers are sufficient. That is far simpler, and it immediately identifies the root cause of the problem. Do we really need an additional alert that the suite has stalled because of the failure? If a task has failed there will likely be consequences, and you should check to see what they are.

By the way, timeout event hooks enable detection of stalled suites - although admittedly not as quickly as this proposal - if it were feasible - would.

I can see a case for warning of suites that are stalled due to missing task proxies, e.g. after some bad manual intervention, or after a restart from a corrupted state file (but still, can we detect this automatically, in general?)

hjoliver · 2015-01-20T21:11:20Z

After a bit of thinking, maybe this isn't so difficult (although in the failed task case, I still think it's unnecessary).

My clock-triggered failure-recovery example above is pretty far-fetched, and I guess anyone with an exotic external monitoring and intervention system could choose not to use the "suite stalled" hook.
(Nevertheless, we'd have to warn that it's possible for a suite stalled warning to be wrong ... and are there any other misdiagnosis scenarios?).

How about this:

If no tasks are submitted or running, examine all 'waiting' tasks. A waiting task is ok if all of its unsatisfied prerequisites are on other waiting task proxies that either exist already or have an earlier instance that exists and will spawn into the right cycle point. But if any task depends on success of a task that has failed (or on failure of a task that has succeeded, etc.) or has an unsatisfied prerequisite on a task that does not exist and has no predecessor that exists and will spawn into the right cycle, then the suite is stalled (assuming that nothing is submitted or running at this time).

We don't need to consider clock triggers - if they've triggered already, they're not the reason for the task waiting; if they've not triggered yet, the important question is can the task's other prerequisites (if any) be satisfied or not as above.

hjoliver · 2015-01-20T23:27:38Z

We should probably change the title of this issue to "suite stalled warning".

dpmatthews · 2015-01-21T07:55:25Z

This feature would be very useful for rose stem suites where the users would like to be able to configure the suite to stop when it has stalled (at the moment they either have to let it time out or stop after the first failure).

arjclark · 2015-01-21T08:54:40Z

@dpmatthews - not sure how useful this is for rose stem suites. Since they don't cycle and only run for (relatively) short periods of time, email notification on a failure followed by a shutdown on suite timeout is more than sufficient. They also tend to be much more closely monitored than trial suites (such as the one that prompted Kerry to raise the issue) as their results are tied to code testing so rather than running something and waiting a month before looking back its a case of running it and checking in a few hours later (if not before).

dpmatthews · 2015-01-21T08:56:44Z

@arjclark - see https://code.metoffice.gov.uk/trac/um/ticket/194

arjclark · 2015-01-21T09:14:51Z

Fair enough, there does appear to be a desire for something along the lines of what Hilary's suggested.

arjclark · 2016-05-12T10:35:22Z

I think I have a solution for this now. Awaiting merge of #1775 and then I'll open up a pull request.

kaday added this to the later milestone Jan 16, 2015

matthewrmshin assigned kaday Jan 20, 2015

matthewrmshin changed the title ~~cylc suite: runahead limit reached warning~~ suite stalled event Jan 21, 2015

matthewrmshin changed the title ~~suite stalled event~~ suite stalled warning or notification Mar 12, 2015

matthewrmshin mentioned this issue Aug 28, 2015

cylc dependencies: report prerequisites that cannot be met #1585

Closed

arjclark mentioned this issue Aug 28, 2015

record warm start times for restarts #1587

Closed

arjclark mentioned this issue Sep 14, 2015

cylc dependencies issues #1603

Closed

arjclark assigned arjclark and unassigned kaday May 12, 2016

matthewrmshin modified the milestones: soon, later May 12, 2016

arjclark mentioned this issue May 18, 2016

Implement suite stalled handler. #1848

Merged

matthewrmshin modified the milestones: next release, soon May 20, 2016

matthewrmshin closed this as completed in #1848 May 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suite stalled warning or notification #1286

suite stalled warning or notification #1286

kaday commented Jan 16, 2015

hjoliver commented Jan 19, 2015

matthewrmshin commented Jan 20, 2015

arjclark commented Jan 20, 2015

arjclark commented Jan 20, 2015

matthewrmshin commented Jan 20, 2015

hjoliver commented Jan 20, 2015

hjoliver commented Jan 20, 2015

hjoliver commented Jan 20, 2015

dpmatthews commented Jan 21, 2015

arjclark commented Jan 21, 2015

dpmatthews commented Jan 21, 2015

arjclark commented Jan 21, 2015

arjclark commented May 12, 2016

suite stalled warning or notification #1286

suite stalled warning or notification #1286

Comments

kaday commented Jan 16, 2015

hjoliver commented Jan 19, 2015

matthewrmshin commented Jan 20, 2015

arjclark commented Jan 20, 2015

arjclark commented Jan 20, 2015

matthewrmshin commented Jan 20, 2015

hjoliver commented Jan 20, 2015

hjoliver commented Jan 20, 2015

hjoliver commented Jan 20, 2015

dpmatthews commented Jan 21, 2015

arjclark commented Jan 21, 2015

dpmatthews commented Jan 21, 2015

arjclark commented Jan 21, 2015

arjclark commented May 12, 2016