Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job polling broken for failed jobs after restart #4516

Closed
oliver-sanders opened this issue Nov 16, 2021 · 5 comments · Fixed by #5016
Closed

job polling broken for failed jobs after restart #4516

oliver-sanders opened this issue Nov 16, 2021 · 5 comments · Fixed by #5016
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

tldr;

Failed tasks can be polled back to incorrect states on restart.

Bug:

After a restart Cylc updates task proxies with the owner@host pair of submitted/running jobs to allow polling:

if status in (TASK_STATUS_SUBMITTED, TASK_STATUS_RUNNING):
itask.state.set_prerequisites_all_satisfied()
# update the task proxy with user@host
try:
itask.task_owner, itask.task_host = user_at_host.split(
"@", 1)
except (AttributeError, ValueError):
itask.task_owner = None
itask.task_host = user_at_host

This, however, excludes succeeded and failed tasks. Consequently, following restart remote tasks do not have their owner@host loaded from the DB which causes polling to run locally.

Polling will most likely fail but could also produce unexpected results (particularly for the case of background jobs).

This may be related to #1792 which extended polling to succeeded / failed tasks but didn't extend the owner@host update logic:

https://github.com/cylc/cylc-flow/pull/2396/files#diff-1f1aa9b850f9d1655a22322beb0e2d0604fb816b3bc807210120547f1a35ae24

When this effect is combined with a task failing by hitting execution time limit on a remote batch system (that is not pollable locally) this causes the task to be polled back to running.

Reproducible Example:

[scheduling]                                                                
    [[dependencies]]                                                        
        graph = """                                                         
            a                                                               
            a:fail => restart                                               
        """                                                                 
                                                                            
[runtime]                                                                   
    [[a]]                                                                   
        script = """                                                        
            sleep 60
        """
        [[[remote]]]
            host = <host>
        [[[job]]]
            execution time limit = PT1S
            batch system = pbs                                                                           
    [[restart]]                    
        script = """               
            cylc stop "${CYLC_SUITE_NAME}" --now --now    
            sleep 5                
            cylc restart "${CYLC_SUITE_NAME}" --host=localhost              
        """

Log Snippet (post-restart):

LOADING task proxies                                                                                     
+ a.1 failed    
+ restart.1 running    
LOADING task action timers    
+ a.1 [[u'job-logs-retrieve', u'failed'], 1]    
+ a.1 [u'try_timers', u'retrying']    
+ a.1 [u'try_timers', u'submit-retrying']    
+ restart.1 poll_timer    
+ restart.1 [u'try_timers', u'retrying']    
+ restart.1 [u'try_timers', u'submit-retrying']    
[a.1] status=failed: (polled)succeeded at 2021-11-16T10:15:19Z for job(01)           <= ERROR
[restart.1] status=running: (polled)succeeded at 2021-11-16T10:17:12Z for job(01)  

Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md before starting any work though).

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Nov 16, 2021
@oliver-sanders oliver-sanders added this to the cylc-7.8.x milestone Nov 16, 2021
@oliver-sanders
Copy link
Member Author

I can't test this with Cylc 8 at the moment, however, I expect the bug will likely be present there too.

@oliver-sanders
Copy link
Member Author

The solution is presumably to update the owner@host for succeeded and failed tasks. Will need to check the logic to ensure this doesn't produce any unexpected side effects in other parts of the code e.g. host-selection.

@oliver-sanders
Copy link
Member Author

Cylc 8 issue - #4513

@dpmatthews
Copy link
Contributor

Cylc 8 issue - #4513

#4513 is really a different issue (polling doing the wrong thing).
This issue is about polling happening on the wrong platform.

I've confirmed that this remains an issue at Cylc 8.
I've had 2 recent reports of this problem so I think we really need to get it fixed in both Cylc 7 & 8.

@oliver-sanders
Copy link
Member Author

Closed by #5016

@hjoliver hjoliver modified the milestones: cylc-7.8.x, 7.8.12 Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants