Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orchestrator: Avoid restarting a task that has already been restarted #1305

Merged
merged 1 commit into from
Aug 3, 2016

Commits on Aug 3, 2016

  1. orchestrator: Avoid restarting a task that has already been restarted

    When I was debugging integration test failures a week ago, I dug deep
    into the recent restart supervisor changes, and noticed a possible race
    which is almost impossible in practice.
    
    This is the sequence of events that has to happen:
    
    1) Restart task A -> new task B.
    
    2) B fails before the restart delay has elapsed (pull failure):
    Restart(B) starts waiter goroutine.
    
    3) B's node fails at exact moment after the restart delay finishes. The
    orchestrator calls Restart(B) again, this time there's no delay in
    progress so Restart proceeds to set B.DesiredState = SHUTDOWN and create
    a new task.
    
    4) The waiter goroutine from step 2 is scheduled. It calls Restart(B) to
    resume the first restart attempt. This sets B.DesiredState = SHUTDOWN
    (which means no change) and creates a new task.
    
    We'd end up with an extra task.
    
    Again, this would be almost impossible to trigger, but I'm fixing it for
    the sake of correctness. The general principle here is that Restart
    should never been called on a task that already has DesiredState >
    RUNNING, since what Restart does is "shut down this task and replace it
    with a new one".
    
    I also added a sanity check to Restart. I'm not sure this is really
    helpful because returning an error probably does more harm than good in
    this case. But at least it would cause an error message to be logged.
    
    Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
    aaronlehmann committed Aug 3, 2016
    Configuration menu
    Copy the full SHA
    1f40cec View commit details
    Browse the repository at this point in the history