job polling broken for failed jobs after restart #4516

oliver-sanders · 2021-11-16T10:22:51Z

tldr;

Failed tasks can be polled back to incorrect states on restart.

Bug:

After a restart Cylc updates task proxies with the owner@host pair of submitted/running jobs to allow polling:

Lines 361 to 369 in 5ef4419

    
           if status in (TASK_STATUS_SUBMITTED, TASK_STATUS_RUNNING): 
        
               itask.state.set_prerequisites_all_satisfied() 
        
               # update the task proxy with user@host 
        
               try: 
        
                   itask.task_owner, itask.task_host = user_at_host.split( 
        
                       "@", 1) 
        
               except (AttributeError, ValueError): 
        
                   itask.task_owner = None 
        
                   itask.task_host = user_at_host

This, however, excludes succeeded and failed tasks. Consequently, following restart remote tasks do not have their owner@host loaded from the DB which causes polling to run locally.

Polling will most likely fail but could also produce unexpected results (particularly for the case of background jobs).

This may be related to #1792 which extended polling to succeeded / failed tasks but didn't extend the owner@host update logic:

https://github.com/cylc/cylc-flow/pull/2396/files#diff-1f1aa9b850f9d1655a22322beb0e2d0604fb816b3bc807210120547f1a35ae24

When this effect is combined with a task failing by hitting execution time limit on a remote batch system (that is not pollable locally) this causes the task to be polled back to running.

Reproducible Example:

[scheduling]                                                                
    [[dependencies]]                                                        
        graph = """                                                         
            a                                                               
            a:fail => restart                                               
        """                                                                 
                                                                            
[runtime]                                                                   
    [[a]]                                                                   
        script = """                                                        
            sleep 60
        """
        [[[remote]]]
            host = <host>
        [[[job]]]
            execution time limit = PT1S
            batch system = pbs                                                                           
    [[restart]]                    
        script = """               
            cylc stop "${CYLC_SUITE_NAME}" --now --now    
            sleep 5                
            cylc restart "${CYLC_SUITE_NAME}" --host=localhost              
        """

Log Snippet (post-restart):

LOADING task proxies                                                                                     
+ a.1 failed    
+ restart.1 running    
LOADING task action timers    
+ a.1 [[u'job-logs-retrieve', u'failed'], 1]    
+ a.1 [u'try_timers', u'retrying']    
+ a.1 [u'try_timers', u'submit-retrying']    
+ restart.1 poll_timer    
+ restart.1 [u'try_timers', u'retrying']    
+ restart.1 [u'try_timers', u'submit-retrying']    
[a.1] status=failed: (polled)succeeded at 2021-11-16T10:15:19Z for job(01)           <= ERROR
[restart.1] status=running: (polled)succeeded at 2021-11-16T10:17:12Z for job(01)

Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md before starting any work though).

The text was updated successfully, but these errors were encountered:

oliver-sanders · 2021-11-16T10:26:25Z

I can't test this with Cylc 8 at the moment, however, I expect the bug will likely be present there too.

oliver-sanders · 2021-11-16T14:56:45Z

The solution is presumably to update the owner@host for succeeded and failed tasks. Will need to check the logic to ensure this doesn't produce any unexpected side effects in other parts of the code e.g. host-selection.

oliver-sanders · 2021-11-29T14:41:55Z

Cylc 8 issue - #4513

dpmatthews · 2022-07-07T16:02:39Z

Cylc 8 issue - #4513

#4513 is really a different issue (polling doing the wrong thing).
This issue is about polling happening on the wrong platform.

I've confirmed that this remains an issue at Cylc 8.
I've had 2 recent reports of this problem so I think we really need to get it fixed in both Cylc 7 & 8.

oliver-sanders · 2022-09-14T10:56:58Z

Closed by #5016

oliver-sanders added the bug Something is wrong :( label Nov 16, 2021

oliver-sanders added this to the cylc-7.8.x milestone Nov 16, 2021

oliver-sanders mentioned this issue Nov 18, 2021

actually load contact info before sending a task message. #4518

Merged

7 tasks

oliver-sanders mentioned this issue Nov 29, 2021

Polling can incorrectly return a failed task to the running state #4513

Open

hjoliver mentioned this issue Dec 7, 2021

2021 Cylc Meetings cylc/cylc-admin#139

Closed

hjoliver mentioned this issue Feb 10, 2022

Only poll non-waiting tasks #4658

Merged

7 tasks

wxtim self-assigned this Jul 26, 2022

wxtim mentioned this issue Jul 26, 2022

add owner@host for SUCCEEDED and FAILED tasks #5016

Merged

7 tasks

wxtim linked a pull request Jul 27, 2022 that will close this issue

add owner@host for SUCCEEDED and FAILED tasks #5016

Merged

7 tasks

wxtim mentioned this issue Aug 12, 2022

job polling broken for failed jobs after restart #5063

Closed

MetRonnie mentioned this issue Aug 12, 2022

add platform on reload for SUCCEEDED and FAILED tasks #5025

Merged

8 tasks

oliver-sanders closed this as completed Sep 14, 2022

hjoliver modified the milestones: cylc-7.8.x, 7.8.12 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job polling broken for failed jobs after restart #4516

job polling broken for failed jobs after restart #4516

oliver-sanders commented Nov 16, 2021

oliver-sanders commented Nov 16, 2021

oliver-sanders commented Nov 16, 2021

oliver-sanders commented Nov 29, 2021

dpmatthews commented Jul 7, 2022

oliver-sanders commented Sep 14, 2022

job polling broken for failed jobs after restart #4516

job polling broken for failed jobs after restart #4516

Comments

oliver-sanders commented Nov 16, 2021

oliver-sanders commented Nov 16, 2021

oliver-sanders commented Nov 16, 2021

oliver-sanders commented Nov 29, 2021

dpmatthews commented Jul 7, 2022

oliver-sanders commented Sep 14, 2022