Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMM can leave a forgotten task forever in missing state #6479

Closed
crusaderky opened this issue May 31, 2022 · 5 comments · Fixed by #6638
Closed

AMM can leave a forgotten task forever in missing state #6479

crusaderky opened this issue May 31, 2022 · 5 comments · Fixed by #6638
Assignees

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented May 31, 2022

  1. The scheduler sends acquire-replicas to the worker
  2. Before the task can transition to flight, the task is forgotten on the scheduler. This in turn triggers a release-key command from the scheduler to all workers which hold the task in memory. This is unlike in a compute-task request, where the workers that are fetching the key are tracked on the scheduler side and receive a release-key command too.
  3. the task eventually transitions fetch->flight->missing
  4. the task remains stuck forever in missing state on the worker that was fetching it. The worker will pester the scheduler every second with find_missing.

Reproducer:

@gen_cluster(client=True)
async def test_forget_acquire_replicas(c, s, a, b):
    """
    1. The scheduler sends acquire-replicas to the worker
    2. Before the task can transition to flight, the task is forgotten on the scheduler
       and on the peer workers *holding the replicas*. 
       This is unlike in a compute-task command, where the workers that are fetching
       the key are tracked on the scheduler side and receive a release-key command too.
    3. the task eventually transitions fetch->flight->missing
    4. Test that the task is eventually forgotten everywhere.
    """
    x = c.submit(inc, 2, key="x", workers=[a.address])
    await x
    with freeze_data_fetching(b, jump_start=True):
        s.request_acquire_replicas(b.address, ["x"], stimulus_id="one")
        await wait_for_state("x", "fetch", b)
        x.release()
        while "x" in s.tasks or "x" in a.tasks:
            await asyncio.sleep(0.01)

    while "x" in b.tasks:
        await asyncio.sleep(0.01)

The above test times out on the last line.

Proposed design

At the moment, the scheduler silently ignores missing keys in the request-refresh-who-has message.
The scheduler should instead respond stating "I don't know about this key, you should forget about it too". This would trigger the key to be forgotten on the worker too.

Blockers

CC @fjetter

@crusaderky crusaderky self-assigned this May 31, 2022
@crusaderky
Copy link
Collaborator Author

crusaderky commented Jun 11, 2022

Same problem for tasks in flight:

@gen_cluster(client=True, nthreads=[("", 1)], timeout=3)
async def test_forget_acquire_replicas_flight(c, s, a):
    """If a dependency fetch finishes on a worker after the scheduler already released
    everything, the worker might be stuck with a redundant replica which is never
    cleaned up.
    """
    async with BlockedGatherDep(s.address) as b:
        x = c.submit(inc, 1, key="x", workers=[a.address])
        await x

        s.request_acquire_replicas(b.address, ["x"], stimulus_id="test")
        await b.in_gather_dep.wait()
        assert b.tasks["x"].state == "flight"

        x.release()
        while "x" in s.tasks:
            await asyncio.sleep(0.01)

        b.block_gather_dep.set()
        while b.tasks:
            await asyncio.sleep(0.01)

XREF test_forget_data_not_supposed_to_have for tasks acquired through compute-task

@fjetter
Copy link
Member

fjetter commented Jun 15, 2022

"I don't know about this key, you should forget about it too". This would trigger the key to be forgotten on the worker too.

I remember having similar logic on the scheduler before with "I don't know about this task, please forget it" in various circumstances. This led to deadlocks due to race conditions

Wouldn't a better fix be for the scheduler be instead to remember who it asked to acquire a replica?

@crusaderky
Copy link
Collaborator Author

crusaderky commented Jun 15, 2022

Everything goes through the same batched comms, sequentially.
I can't think of any way to create a race condition, where the key is recreated on the scheduler and the scheduler sends an event to the worker about it before the deletion event is processed, without involving RPC.

The RPC commands that can cause a race condition are:

[EDIT] see epic #6604

Wouldn't a better fix be for the scheduler be instead to remember who it asked to acquire a replica?

It's possible, but also unnecessarily complicated IMHO.
Related: #5759

@crusaderky
Copy link
Collaborator Author

crusaderky commented Jun 17, 2022

@fjetter are you happy for me to proceed with my design?

@fjetter
Copy link
Member

fjetter commented Jun 20, 2022

Yes, sounds good. I just wanted to highlight the possibility. if batchedsend ordering protects us, that's good news 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants