Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve task polling and timeout handling #2593

Merged
merged 5 commits into from
May 1, 2018
Merged

Conversation

matthewrmshin
Copy link
Contributor

One variable to hold timeout and one variable to hold a poll timer. Populate
timeout time and poll timer only when task is in appropriate state. Reset
timeout variable after use. Combine execution polling schedule with execution
time limit polling schedule.

Not quite a full rewrite, but should be sufficient to fix #2568.

@matthewrmshin matthewrmshin added the bug Something is wrong :( label Mar 2, 2018
@matthewrmshin matthewrmshin added this to the soon milestone Mar 2, 2018
@matthewrmshin matthewrmshin self-assigned this Mar 2, 2018
@matthewrmshin
Copy link
Contributor Author

matthewrmshin commented Mar 2, 2018

In the long run, I would quite like to separate the concept of tasks and jobs, (e.g. #2241 (comment)). With that, polling should be a job management functionality. I'd quite like to have a way to manage polling schedules using a centralised poller. A job will propose its polling schedule to the poller, but the poller will decide the best times - so it is able to bunch together related polls (e.g. task jobs on the same cluster) in its schedules.

@matthewrmshin
Copy link
Contributor Author

(Codacy failure will be fixed by #2567.)

@matthewrmshin
Copy link
Contributor Author

(Travis CI has passed but is reporting in progress here for some reason.)

@hjoliver
Copy link
Member

hjoliver commented Mar 7, 2018

I've just received a bug report for a production system - the problem probably resulted from doing all post-timeout polls immediately as described in #2568, instead of waiting on the delay intervals. It would be good to have a test to confirm that polling occurs after the configured intervals.

@matthewrmshin
Copy link
Contributor Author

@hjoliver Yes, I am working on turning a manual test to an automated one. Not quite there yet.

@hjoliver
Copy link
Member

hjoliver commented Mar 7, 2018

Great, thanks!

@hjoliver
Copy link
Member

hjoliver commented Mar 9, 2018

Activity of one task foo.1 - on this branch - killed externally, with execution time limit = PT30S and global.rc execution time limit polling intervals = PT30s,...:

2018-03-09T17:08:02+13 INFO - [foo.1] -submit-num=1, owner@host=localhost
2018-03-09T17:08:03+13 INFO - [foo.1] -(current:ready) submitted at 2018-03-09T17:08:02+13
2018-03-09T17:08:03+13 INFO - [foo.1] -job[01] submitted to localhost:background[18298]
2018-03-09T17:08:03+13 INFO - [foo.1] -(current:submitted)> started at 2018-03-09T17:08:02+13
2018-03-09T17:08:03+13 INFO - [foo.1] -next job poll in PT1M (after 2018-03-09T17:09:03+13)
2018-03-09T17:09:03+13 INFO - [foo.1] -next job poll in PT40S (after 2018-03-09T17:09:43+13)
2018-03-09T17:09:04+13 INFO - [foo.1] -(current:running) failed (polled)
2018-03-09T17:09:04+13 CRITICAL - [foo.1] -job(01) failed

This looks pretty good, but I wonder if we could have:

  • at job start, say "first job poll in PT1M ..." (not next job poll)
  • say "polling job now at ... " when issuing the poll, to make it clearer that the poll result is not associated with the "next job poll..." message printed just prior to it.

@hjoliver
Copy link
Member

hjoliver commented Mar 9, 2018

We should also mention execution time limit polling intervals under execution time limit in the CUG global.rc reference.

@matthewrmshin
Copy link
Contributor Author

Modified the logging logic to do:

  • On set up, print submission/execution time out and full set of polling intervals.
  • On ready to poll, print poll now, (next in ...).

Example:

2018-03-09T15:11:09Z INFO - [t1.1] -health check settings: submission timeout=P1D, polling intervals=PT12S,...
2018-03-09T15:11:21Z INFO - [t1.1] -poll now, (next in PT12S (after 2018-03-09T15:11:33Z))

@matthewrmshin
Copy link
Contributor Author

(Rebased + conflict resolved. New test moved from tests/job-poll/03-* to tests/cylc-poll/16-*.)

@hjoliver
Copy link
Member

Damn, I forgot about this bug! To my mind, this one is worse than the other bugs fixed for the just-released 7.6.1. @matthewrmshin and @oliver-sanders - we should get this reviewed and merged quickly, if possible, and bang out 7.6.2.

@matthewrmshin
Copy link
Contributor Author

#2606 should make this one less lethal, but yes, I agree this fix should go in as soon as possible.

key1, submit_num = ctx_key
key = (key1, cycle, name, submit_num)
self.task_events_mgr.event_timers[key] = TaskActionTimer(
if ctx_key == "poll_timer":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx_key = "?"
    if ctx_key == "poll_timer":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should now be fixed.

@sadielbartholomew
Copy link
Collaborator

(Adding myself as reviewer upon @matthewrmshin's request in person).

@sadielbartholomew
Copy link
Collaborator

sadielbartholomew commented Apr 16, 2018

Reviewing in progress. On hold until #2582 is reviewed as that is higher priority.

Still considering the core logical intention & code changes, but in the meantime I noticed some of the tests could be rewritten in my opinion in a more explicit & adaptable way - see side-PR referenced.

@matthewrmshin
Copy link
Contributor Author

Branch re-based. Conflicts resolved.

@matthewrmshin
Copy link
Contributor Author

Branch re-based. Conflicts resolved.

[runtime]
[[foo]]
script = """
#rm "$CYLC_TASK_LOG_ROOT.status" # Disable polling.
Copy link
Collaborator

@sadielbartholomew sadielbartholomew Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#rm seems to have unintentionally been left in here from development/debugging stages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now removed the #rm command.

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensible change which works as described & fixes associated bug. Passes my local test battery & new tests suitable. A few minor comments which might be noteworthy (sorry, meant to include the previous comment with these but pressed the wrong confirm button).

batch_sys_conf = {}
time_limit_delays = batch_sys_conf.get(
'execution time limit polling intervals', [60, 120, 420])
timeout = time_limit + sum(time_limit_delays)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception catch on 908 will result in time_limit_delays getting set to None, which will throw a TypeError for the sum() method, I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if batch_sys_conf is an empty dict, time_limit_delays will be set to the default value [60, 120, 420] via the 2nd argument of the .get method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, my mistake!

delays.extend([delays[-1]] * size)
time_limit_delays[0] += time_limit - sum(delays)
delays += time_limit_delays
else: # if itask.state.status == TASK_STATUS_SUBMITTED:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding an i.e. or similar to this comment as I went to check the nature of TASK_STATUSES_ACTIVE set to work out whether it was meant to be there thinking it e.g. might have been left in from the development stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is deliberate. It is saying that the else: statement:

else:  # if itask.state.status == TASK_STATUS_SUBMITTED:

is equivalent to:

elif itask.state.status == TASK_STATUS_SUBMITTED:

The bare else: is faster because it does not have to evaluate itask.state.status == TASK_STATUS_SUBMITTED.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood this, my point is that it wasn't immediately obvious to me that this comment was implying equivalence (hence the suggestion 'i.e.') as opposed to it being something left there during dev. Though it could easily just be me it was not at first evident to! Sorry if I was not clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point @sadielbartholomew but I think we use this idiom fairly often (possibly omitting the "if" in the comment though)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, it is clearly confusion arising from my lack of experience - when I see similar comments I will now know distinctly what they mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No apologies necessary! (and I wouldn't too surprised if you see a few things we do that aren't clear or correct).

# Set timeout
timeref = None # reference time, submitted or started time
timeout = None # timeout in setting
delays = [] # polling intervals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell it is unnecessary to initialise delays as a list given the get_host_conf() method will always be called & with default=[900] specified, which will get returned from that method if all else 'fails'.

message = 'health check settings: %s=%s' % (timeout_key, timeout_str)
# Attempt to group idenitical consecutive delays as N*DELAY,...
if itask.poll_timer.delays:
items = [] # [(number of item - 1, item), ...]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

940:951 is nice! A clever way to process & report the delays.

@matthewrmshin
Copy link
Contributor Author

Branch squashed and re-based. Conflicts resolved.

@hjoliver
Copy link
Member

hjoliver commented Apr 27, 2018

(I've started re-reviewing this) (note new conflicts).

matthewrmshin and others added 5 commits April 27, 2018 10:14
One variable to hold timeout and one variable to hold a poll timer. Populate
timeout time and poll timer only when task is in appropriate state. Reset
timeout variable after use. Combine execution polling schedule with execution
time limit polling schedule.
Info on submission/execution timeout.
Info on polling schedule.
Info on issuing a poll, and estimated delay for next one.
Add link to `execution time limit polling intervals`.
@hjoliver
Copy link
Member

hjoliver commented May 1, 2018

(mostly reviewed, I'm intending to finish it this evening)

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problems found, all good.

@hjoliver hjoliver merged commit 6e561b3 into cylc:master May 1, 2018
@matthewrmshin matthewrmshin deleted the poll branch May 1, 2018 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Polling Logic Re-Write
4 participants