Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suite stalled warning or notification #1286

Closed
kaday opened this issue Jan 16, 2015 · 13 comments
Closed

suite stalled warning or notification #1286

kaday opened this issue Jan 16, 2015 · 13 comments
Assignees
Milestone

Comments

@kaday
Copy link
Contributor

kaday commented Jan 16, 2015

A suite is using a runahead limit because the suite cycles have no dependancies on previous cycles, but need to be limited due to large data volumes.
It would be helpful that if the suite reaches its runahead limit because of an earlier task has failed that a warning message is recorded in the stderr (and mail is sent as an option) to aid the user to identify why the suite is waiting.

@kaday kaday added this to the later milestone Jan 16, 2015
@hjoliver
Copy link
Member

@kaday - I'm not sure this is a good idea, although you can try to convince me! Any suite will typically be runahead limited some or all of the time, unless it is being constrained by clock-triggers in real time caught-up operation, and that isn't a condition that warrants a warning - it's merely a mechanism to stop the suite getting unnecessarily spread out, or potentially to prevent too many cycle points running at once for compute-loading reasons.

Furthermore, I'm not sure I understand the example you've given. Failed tasks have no effect on the runahead limit, so presumably you mean that the suite has stalled because a task that depends on the failed one is unable to run? This does indeed hold the runahead limit down, but the suite operator should have got an error message about the failed task, which is the root cause of the problem - is this not sufficient in your opinion?

@matthewrmshin
Copy link
Contributor

Is it possible to trigger a special event when the suite can no longer proceed any further?

@arjclark
Copy link
Contributor

@matthewrmshin - I guess that would need to be "can no longer proceed any further due to a failed task"? A suite can get to the point where it can't proceed any further due to a running task, which is regular day-to-day behaviour.

@arjclark
Copy link
Contributor

e.g for a simple graph = foo suite, where everything's running fine, you aren't in error if you hit the runahead limit due to the earliest foo task still running.

@matthewrmshin
Copy link
Contributor

@matthewrmshin - I guess that would need to be "can no longer proceed any further due to a failed task"? A suite can get to the point where it can't proceed any further due to a running task, which is regular day-to-day behaviour.

Yes. It is when the suite requires human input - it is stuck and is no longer anticipating any automatic future events.

@hjoliver
Copy link
Member

Yes. It is when the suite requires human input - it is stuck and is no longer anticipating any automatic future events.

That may be very hard to determine in general. E.g. a failure recovery task could be waiting on a clock-trigger before it cleans up the failed task and its downstream dependants. So "no longer anticipating any automatic future events" has to include clock-triggers ... but this hourly-cycling suite could have a couple of monthly-cycling tasks that are already waiting on a distant clock-trigger. Worst case scenario of course, some external suite or system might be monitoring the suite and be primed to manipulate it.

It seems to me that task failed event handlers are sufficient. That is far simpler, and it immediately identifies the root cause of the problem. Do we really need an additional alert that the suite has stalled because of the failure? If a task has failed there will likely be consequences, and you should check to see what they are.

By the way, timeout event hooks enable detection of stalled suites - although admittedly not as quickly as this proposal - if it were feasible - would.

I can see a case for warning of suites that are stalled due to missing task proxies, e.g. after some bad manual intervention, or after a restart from a corrupted state file (but still, can we detect this automatically, in general?)

@hjoliver
Copy link
Member

After a bit of thinking, maybe this isn't so difficult (although in the failed task case, I still think it's unnecessary).

My clock-triggered failure-recovery example above is pretty far-fetched, and I guess anyone with an exotic external monitoring and intervention system could choose not to use the "suite stalled" hook.
(Nevertheless, we'd have to warn that it's possible for a suite stalled warning to be wrong ... and are there any other misdiagnosis scenarios?).

How about this:

If no tasks are submitted or running, examine all 'waiting' tasks. A waiting task is ok if all of its unsatisfied prerequisites are on other waiting task proxies that either exist already or have an earlier instance that exists and will spawn into the right cycle point. But if any task depends on success of a task that has failed (or on failure of a task that has succeeded, etc.) or has an unsatisfied prerequisite on a task that does not exist and has no predecessor that exists and will spawn into the right cycle, then the suite is stalled (assuming that nothing is submitted or running at this time).

We don't need to consider clock triggers - if they've triggered already, they're not the reason for the task waiting; if they've not triggered yet, the important question is can the task's other prerequisites (if any) be satisfied or not as above.

@hjoliver
Copy link
Member

We should probably change the title of this issue to "suite stalled warning".

@dpmatthews
Copy link
Contributor

This feature would be very useful for rose stem suites where the users would like to be able to configure the suite to stop when it has stalled (at the moment they either have to let it time out or stop after the first failure).

@arjclark
Copy link
Contributor

@dpmatthews - not sure how useful this is for rose stem suites. Since they don't cycle and only run for (relatively) short periods of time, email notification on a failure followed by a shutdown on suite timeout is more than sufficient. They also tend to be much more closely monitored than trial suites (such as the one that prompted Kerry to raise the issue) as their results are tied to code testing so rather than running something and waiting a month before looking back its a case of running it and checking in a few hours later (if not before).

@dpmatthews
Copy link
Contributor

@arjclark
Copy link
Contributor

Fair enough, there does appear to be a desire for something along the lines of what Hilary's suggested.

@matthewrmshin matthewrmshin changed the title cylc suite: runahead limit reached warning suite stalled event Jan 21, 2015
@matthewrmshin matthewrmshin changed the title suite stalled event suite stalled warning or notification Mar 12, 2015
@arjclark arjclark assigned arjclark and unassigned kaday May 12, 2016
@arjclark
Copy link
Contributor

I think I have a solution for this now. Awaiting merge of #1775 and then I'll open up a pull request.

@matthewrmshin matthewrmshin modified the milestones: soon, later May 12, 2016
@matthewrmshin matthewrmshin modified the milestones: next release, soon May 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants