Improve task polling and timeout handling #2593

matthewrmshin · 2018-03-02T13:59:02Z

One variable to hold timeout and one variable to hold a poll timer. Populate
timeout time and poll timer only when task is in appropriate state. Reset
timeout variable after use. Combine execution polling schedule with execution
time limit polling schedule.

Not quite a full rewrite, but should be sufficient to fix #2568.

matthewrmshin · 2018-03-02T14:08:51Z

In the long run, I would quite like to separate the concept of tasks and jobs, (e.g. #2241 (comment)). With that, polling should be a job management functionality. I'd quite like to have a way to manage polling schedules using a centralised poller. A job will propose its polling schedule to the poller, but the poller will decide the best times - so it is able to bunch together related polls (e.g. task jobs on the same cluster) in its schedules.

matthewrmshin · 2018-03-02T14:10:33Z

(Codacy failure will be fixed by #2567.)

matthewrmshin · 2018-03-02T14:46:28Z

(Travis CI has passed but is reporting in progress here for some reason.)

hjoliver · 2018-03-07T08:37:16Z

I've just received a bug report for a production system - the problem probably resulted from doing all post-timeout polls immediately as described in #2568, instead of waiting on the delay intervals. It would be good to have a test to confirm that polling occurs after the configured intervals.

matthewrmshin · 2018-03-07T08:43:34Z

@hjoliver Yes, I am working on turning a manual test to an automated one. Not quite there yet.

hjoliver · 2018-03-07T08:54:07Z

Great, thanks!

hjoliver · 2018-03-09T04:22:07Z

Activity of one task foo.1 - on this branch - killed externally, with execution time limit = PT30S and global.rc execution time limit polling intervals = PT30s,...:

2018-03-09T17:08:02+13 INFO - [foo.1] -submit-num=1, owner@host=localhost
2018-03-09T17:08:03+13 INFO - [foo.1] -(current:ready) submitted at 2018-03-09T17:08:02+13
2018-03-09T17:08:03+13 INFO - [foo.1] -job[01] submitted to localhost:background[18298]
2018-03-09T17:08:03+13 INFO - [foo.1] -(current:submitted)> started at 2018-03-09T17:08:02+13
2018-03-09T17:08:03+13 INFO - [foo.1] -next job poll in PT1M (after 2018-03-09T17:09:03+13)
2018-03-09T17:09:03+13 INFO - [foo.1] -next job poll in PT40S (after 2018-03-09T17:09:43+13)
2018-03-09T17:09:04+13 INFO - [foo.1] -(current:running) failed (polled)
2018-03-09T17:09:04+13 CRITICAL - [foo.1] -job(01) failed

This looks pretty good, but I wonder if we could have:

at job start, say "first job poll in PT1M ..." (not next job poll)
say "polling job now at ... " when issuing the poll, to make it clearer that the poll result is not associated with the "next job poll..." message printed just prior to it.

hjoliver · 2018-03-09T04:33:12Z

We should also mention execution time limit polling intervals under execution time limit in the CUG global.rc reference.

matthewrmshin · 2018-03-09T16:21:00Z

Modified the logging logic to do:

On set up, print submission/execution time out and full set of polling intervals.
On ready to poll, print poll now, (next in ...).

Example:

2018-03-09T15:11:09Z INFO - [t1.1] -health check settings: submission timeout=P1D, polling intervals=PT12S,...
2018-03-09T15:11:21Z INFO - [t1.1] -poll now, (next in PT12S (after 2018-03-09T15:11:33Z))

matthewrmshin · 2018-03-27T14:15:44Z

(Rebased + conflict resolved. New test moved from tests/job-poll/03-* to tests/cylc-poll/16-*.)

hjoliver · 2018-03-28T05:29:12Z

Damn, I forgot about this bug! To my mind, this one is worse than the other bugs fixed for the just-released 7.6.1. @matthewrmshin and @oliver-sanders - we should get this reviewed and merged quickly, if possible, and bang out 7.6.2.

matthewrmshin · 2018-03-28T07:34:20Z

#2606 should make this one less lethal, but yes, I agree this fix should go in as soon as possible.

oliver-sanders · 2018-03-27T15:31:26Z

lib/cylc/task_pool.py

-                key1, submit_num = ctx_key
-                key = (key1, cycle, name, submit_num)
-                self.task_events_mgr.event_timers[key] = TaskActionTimer(
+        if ctx_key == "poll_timer":


ctx_key = "?" if ctx_key == "poll_timer":

Should now be fixed.

sadielbartholomew · 2018-04-16T16:13:52Z

(Adding myself as reviewer upon @matthewrmshin's request in person).

sadielbartholomew · 2018-04-16T18:13:20Z

~~Reviewing in progress.~~ On hold until #2582 is reviewed as that is higher priority.

Still considering the core logical intention & code changes, but in the meantime I noticed some of the tests could be rewritten in my opinion in a more explicit & adaptable way - see side-PR referenced.

matthewrmshin · 2018-04-17T19:46:11Z

Branch re-based. Conflicts resolved.

matthewrmshin · 2018-04-19T11:54:42Z

Branch re-based. Conflicts resolved.

sadielbartholomew · 2018-04-19T13:57:29Z

tests/cylc-poll/16-execution-time-limit/suite.rc

+[runtime]
+    [[foo]]
+        script = """
+            #rm "$CYLC_TASK_LOG_ROOT.status"  # Disable polling.


#rm seems to have unintentionally been left in here from development/debugging stages.

I have now removed the #rm command.

sadielbartholomew

Sensible change which works as described & fixes associated bug. Passes my local test battery & new tests suitable. A few minor comments which might be noteworthy (sorry, meant to include the previous comment with these but pressed the wrong confirm button).

sadielbartholomew · 2018-04-19T14:19:32Z

lib/cylc/task_events_mgr.py

+                    batch_sys_conf = {}
+                time_limit_delays = batch_sys_conf.get(
+                    'execution time limit polling intervals', [60, 120, 420])
+                timeout = time_limit + sum(time_limit_delays)


The exception catch on 908 will result in time_limit_delays getting set to None, which will throw a TypeError for the sum() method, I believe.

No, if batch_sys_conf is an empty dict, time_limit_delays will be set to the default value [60, 120, 420] via the 2nd argument of the .get method.

Ah yes, my mistake!

sadielbartholomew · 2018-04-19T14:30:29Z

lib/cylc/task_events_mgr.py

+                    delays.extend([delays[-1]] * size)
+                time_limit_delays[0] += time_limit - sum(delays)
+                delays += time_limit_delays
+        else:  # if itask.state.status == TASK_STATUS_SUBMITTED:


Consider adding an i.e. or similar to this comment as I went to check the nature of TASK_STATUSES_ACTIVE set to work out whether it was meant to be there thinking it e.g. might have been left in from the development stage.

The comment here is deliberate. It is saying that the else: statement:

else: # if itask.state.status == TASK_STATUS_SUBMITTED:

is equivalent to:

elif itask.state.status == TASK_STATUS_SUBMITTED:

The bare else: is faster because it does not have to evaluate itask.state.status == TASK_STATUS_SUBMITTED.

I understood this, my point is that it wasn't immediately obvious to me that this comment was implying equivalence (hence the suggestion 'i.e.') as opposed to it being something left there during dev. Though it could easily just be me it was not at first evident to! Sorry if I was not clear.

I see your point @sadielbartholomew but I think we use this idiom fairly often (possibly omitting the "if" in the comment though)

Apologies, it is clearly confusion arising from my lack of experience - when I see similar comments I will now know distinctly what they mean.

No apologies necessary! (and I wouldn't too surprised if you see a few things we do that aren't clear or correct).

sadielbartholomew · 2018-04-19T14:55:25Z

lib/cylc/task_events_mgr.py

+        # Set timeout
+        timeref = None  # reference time, submitted or started time
+        timeout = None  # timeout in setting
+        delays = []  # polling intervals


As far as I can tell it is unnecessary to initialise delays as a list given the get_host_conf() method will always be called & with default=[900] specified, which will get returned from that method if all else 'fails'.

sadielbartholomew · 2018-04-19T15:36:40Z

lib/cylc/task_events_mgr.py

+        message = 'health check settings: %s=%s' % (timeout_key, timeout_str)
+        # Attempt to group idenitical consecutive delays as N*DELAY,...
+        if itask.poll_timer.delays:
+            items = []  # [(number of item - 1, item), ...]


940:951 is nice! A clever way to process & report the delays.

matthewrmshin · 2018-04-23T10:22:20Z

Branch squashed and re-based. Conflicts resolved.

hjoliver · 2018-04-27T06:12:55Z

(I've started re-reviewing this) (note new conflicts).

One variable to hold timeout and one variable to hold a poll timer. Populate timeout time and poll timer only when task is in appropriate state. Reset timeout variable after use. Combine execution polling schedule with execution time limit polling schedule.

Info on submission/execution timeout. Info on polling schedule. Info on issuing a poll, and estimated delay for next one.

Add link to `execution time limit polling intervals`.

hjoliver · 2018-05-01T04:14:55Z

(mostly reviewed, I'm intending to finish it this evening)

hjoliver

No problems found, all good.

matthewrmshin added the bug Something is wrong :( label Mar 2, 2018

matthewrmshin added this to the soon milestone Mar 2, 2018

matthewrmshin self-assigned this Mar 2, 2018

matthewrmshin force-pushed the poll branch from 03c4563 to 808550f Compare March 9, 2018 16:06

matthewrmshin modified the milestones: soon, next release Mar 13, 2018

matthewrmshin mentioned this pull request Mar 19, 2018

Polling Logic Re-Write #2568

Closed

matthewrmshin force-pushed the poll branch from b79c7f5 to ef35c1f Compare March 21, 2018 14:07

matthewrmshin requested review from hjoliver and oliver-sanders March 21, 2018 16:11

matthewrmshin force-pushed the poll branch from d869224 to c9c0295 Compare March 27, 2018 14:14

oliver-sanders reviewed Mar 28, 2018

View reviewed changes

sadielbartholomew self-requested a review April 16, 2018 16:13

sadielbartholomew mentioned this pull request Apr 16, 2018

Test consolidations making patterns more explicit matthewrmshin/cylc-flow#10

Merged

matthewrmshin force-pushed the poll branch from fc71f9e to 91007be Compare April 17, 2018 19:45

matthewrmshin force-pushed the poll branch from 91007be to 0c3e6c9 Compare April 19, 2018 11:53

sadielbartholomew reviewed Apr 19, 2018

View reviewed changes

matthewrmshin force-pushed the poll branch from 0c3e6c9 to 4ef7d3a Compare April 19, 2018 15:11

sadielbartholomew approved these changes Apr 19, 2018

View reviewed changes

matthewrmshin force-pushed the poll branch from 155305b to fad057e Compare April 23, 2018 09:56

matthewrmshin mentioned this pull request Apr 25, 2018

Fix stop kill multi notifications #2638

Merged

matthewrmshin force-pushed the poll branch from fad057e to dc64095 Compare April 26, 2018 07:31

matthewrmshin and others added 5 commits April 27, 2018 10:14

Improve log message for timeout and poll intervals

d39d78b

Info on submission/execution timeout. Info on polling schedule. Info on issuing a poll, and estimated delay for next one.

Improve doc suite.rc execution time limit

3b3b9ed

Add link to `execution time limit polling intervals`.

Add test from cylc#2520.

a12bfac

Task action timer JSON load logic

d38ee96

matthewrmshin force-pushed the poll branch from dc64095 to d38ee96 Compare April 27, 2018 09:47

hjoliver approved these changes May 1, 2018

View reviewed changes

hjoliver merged commit 6e561b3 into cylc:master May 1, 2018

matthewrmshin deleted the poll branch May 1, 2018 10:40

Improve task polling and timeout handling #2593

Improve task polling and timeout handling #2593

Conversation

matthewrmshin commented Mar 2, 2018

matthewrmshin commented Mar 2, 2018 • edited Loading

matthewrmshin commented Mar 2, 2018

matthewrmshin commented Mar 2, 2018

hjoliver commented Mar 7, 2018 • edited by matthewrmshin Loading

matthewrmshin commented Mar 7, 2018

hjoliver commented Mar 7, 2018

hjoliver commented Mar 9, 2018 • edited Loading

hjoliver commented Mar 9, 2018

matthewrmshin commented Mar 9, 2018

matthewrmshin commented Mar 27, 2018

hjoliver commented Mar 28, 2018

matthewrmshin commented Mar 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadielbartholomew commented Apr 16, 2018

sadielbartholomew commented Apr 16, 2018 • edited Loading

matthewrmshin commented Apr 17, 2018

matthewrmshin commented Apr 19, 2018

sadielbartholomew Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadielbartholomew left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewrmshin commented Apr 23, 2018

hjoliver commented Apr 27, 2018 • edited Loading

hjoliver commented May 1, 2018

hjoliver left a comment

Choose a reason for hiding this comment

matthewrmshin commented Mar 2, 2018 •

edited

Loading

hjoliver commented Mar 7, 2018 •

edited by matthewrmshin

Loading

hjoliver commented Mar 9, 2018 •

edited

Loading

sadielbartholomew commented Apr 16, 2018 •

edited

Loading

sadielbartholomew Apr 19, 2018 •

edited

Loading

hjoliver commented Apr 27, 2018 •

edited

Loading