Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove release handler from scheduler #5091

Closed
wants to merge 1 commit into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jul 19, 2021

This removes the redundant handler Scheduler.handle_release_data. This handler was only used by a single code path on the worker side which is triggered during a Worker.steal_request. There is a subtle race condition going on which looks like this

  • Bob receives steal_request with Alice as intended Thief
  • steal_request msg is queued up on scheduler / in flight / being
    handled * by Bob
  • Alice dies :(
  • Since Bob hasn't replied, yet, Scheduler thinks the task is lost and reschedules it, assigning it to Bob
  • Scheduler queues up msg for Bob to compute-task
  • In the meantime, Bob released the key and notifies the scheduler that it released the key
  • At this time, the scheduler thinks Bob is already processing the task when it runs this handler
  • This handler then causes the scheduler to transition the task back to released

This is not a critical race condition but it is not nice and I don't know if this couldn't escalate to something worse.

More importantly, this handler and stimulus include a lot of dead code. At some point this surely had a meaning but it was lost by now.

There are also seemingly duplicate implementations in Scheduler.handle_missing_data and Scheduler.release_worker_data which are still in use.

@fjetter fjetter force-pushed the remove_scheduler_release_handler branch from e9e42cb to 2931a4f Compare July 19, 2021 14:55
@mrocklin
Copy link
Member

Question, what should happen here? Should this be merged?

@fjetter
Copy link
Member Author

fjetter commented Jul 30, 2021

Question, what should happen here?

That's one of the mysteries I am trying to solve. This is all effectively untested and mostly code from what I understand. I would like to merge it and get rid of this. IMHO, if the removal causes problems, we should rebuild it with appropriate tests.

Should this be merged?

I think so, the missing key handling is one major source for instability and IMHO we should rewrite it anyhow. that's just the first piece

@mrocklin
Copy link
Member

This release or next? (next should, I think, be default)

@jrbourbeau
Copy link
Member

There's been a lot of recent work done on this portion of the codebase. Looking at the worker handlers on the scheduler today, we've removed "release"

worker_handlers = {
"task-finished": self.handle_task_finished,
"task-erred": self.handle_task_erred,
"release-worker-data": self.release_worker_data,
"add-keys": self.add_keys,
"missing-data": self.handle_missing_data,
"long-running": self.handle_long_running,
"reschedule": self.reschedule,
"keep-alive": lambda *args, **kwargs: None,
"log-event": self.log_worker_event,
"worker-status-change": self.handle_worker_status_change,
}

I'm going to close this PR as no longer needed, but feel free to re-open if that's not the case @fjetter

@jrbourbeau jrbourbeau closed this Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants