Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the usecases for Scheduler.handle_missing_data? #6445

Closed
fjetter opened this issue May 25, 2022 · 2 comments · Fixed by #6546
Closed

What are the usecases for Scheduler.handle_missing_data? #6445

fjetter opened this issue May 25, 2022 · 2 comments · Fixed by #6546
Labels
deadlock The cluster appears to not make any progress discussion Discussing a topic with no specific actions yet stability Issue or feature related to cluster stability (e.g. deadlock)

Comments

@fjetter
Copy link
Member

fjetter commented May 25, 2022

Introduction

The scheduler offers a handler handle_missing_data that is called in
one place on the worker, during the response handling of gather_dep

This handling has been known to cause stability and deadlocking issues. In
recent refactoring attempts the Worker.find_missing functionality has been
flagged as a source of complexity and I'm wondering if we should remove this
entire system in favor of a much simpler mechanism.

Looking at the scheduler side, reading code and doc string, there are a couple
of different use cases where this handler is supposed to be called. It always
assumes that the signal originates from a worker, called worker, which reports
that another worker errant_worker is not in possession of a key key.

The doc string explains that this can occur in two situations

  1. worker is operating on stale data, i.e.
    worker.TaskState.who_has/Worker.has_what includes stale information, such
    that after requesting Worker.get_data on errant_worker it returns a
    response without data for key.
  2. errant_worker is indeed dead which is likely inferred by worker because
    it encountered a network error.

If the scheduler already knows about the death of errant_worker, this request is
ignored.

Otherwise, the scheduler accepts this signal as the ultimate truth and removes
all state indicating that key was on errant_worker. There is some transition
logic in place to heal and/or raise the problem as is appropriate.

The circumstances under which a worker actually calls this handler is a bit
harder to describe. I am much more interested in whether or not this handler on
scheduler side makes sense.


Investigating use case 1.) - The worker is operating on stale data. This can
have two reasons

1.a) The data is indeed stale. Something happened that caused us to request
key to be forgotten on errant_worker but we didn't tell worker.

I believe the only valid reason for this is AMM requesting worker to forget
the data. All other instances where we would properly release keys due to
transitions, worker would eventually be asked to release the key as well.

I would argue that this case is not something a worker should escalate to the
scheduler. If anything, the worker should ask the scheduler for the most recent,
accurate information and try again. Particularly with the changes proposed in
#6435 where the worker indeed removes no
longer accurate entries, this use case feels obsolete

1.b) The scheduler has faulty information

I am struggling to come up with a situation where the scheduler is indeed
operating on false information since workers are never allowed to forget data
without the scheduler telling them so. It is possible for workers to have more
data than the scheduler knows (e.g. because messages where not, yet, delivered)
but it's hard to come up with a scenario where the worker lost data.
This could obviously be possible if a worker would loose its data due to
external factors, e.g. disk is lost or a user deliberately removes keys from
Worker.data. I would argue in this situation errant_worker should notice
that an external party is messing with its state and it should escalate to the
scheduler, iff we even want to deal with this edge case.

2.) errant_worker is indeed dead. The scheduler would eventually notice
because Scheduler.stream_comms[errant_worker] would be aborted (at the latest
after distributed.comm.timeouts.tcp seconds)
The only reason why worker should ping the scheduler on this event is if we
want to accelerate recovery. From a consistency perspective, this should not be
required

Conclusion

I am starting to believe that worker should simply remove errant_worker from
it's internal bookkeeping and rely on the scheduler to eventually clean up. We
would accept that a worker.TaskState in state fetch is allowed to have an
empty who_has and we'd rely on the scheduler to clean this up eventually.

The only functionality we'd need on Worker is a mechanism to eventually
refresh the worker's who_has/has_what information.

This could be a system that is not connected to the state machine directly
that ensures that all, or subsets of, data locality information
(who_has/has_what) is updated eventually. This generic system would also
benefit busy tasks.

I believe this reduced coupling would significantly reduce complexity. I'm
wondering if I missed a valid use case.

cc @gjoseph92 @crusaderky

@fjetter fjetter added discussion Discussing a topic with no specific actions yet stability Issue or feature related to cluster stability (e.g. deadlock) deadlock The cluster appears to not make any progress labels May 25, 2022
@crusaderky
Copy link
Collaborator

I think there are two distinct use cases:

  1. the sender worker is alive and well, but responds to gather_dep that it does not have a particular key.
    I agree with you that missing-data should not be necessary in this case, as I cannot think of any way where a release-worker-data message won't reach the scheduler soon enough
  2. the sender worker is dead, and the recipient worker just happened to notice before the scheduler.
    I also agree that missing-data is inappropriate here. However I wonder if the recipient should just send a missing-worker event instead, which accelerates the last rites on the scheduler side? To clarify: this is a matter of performance, not consistency.

@fjetter
Copy link
Member Author

fjetter commented Jun 1, 2022

xref #6348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress discussion Discussing a topic with no specific actions yet stability Issue or feature related to cluster stability (e.g. deadlock)
Projects
None yet
2 participants