AMM can leave a forgotten task forever in missing state #6479

crusaderky · 2022-05-31T14:22:19Z

The scheduler sends acquire-replicas to the worker
Before the task can transition to flight, the task is forgotten on the scheduler. This in turn triggers a release-key command from the scheduler to all workers which hold the task in memory. This is unlike in a compute-task request, where the workers that are fetching the key are tracked on the scheduler side and receive a release-key command too.
the task eventually transitions fetch->flight->missing
the task remains stuck forever in missing state on the worker that was fetching it. The worker will pester the scheduler every second with find_missing.

Reproducer:

@gen_cluster(client=True)
async def test_forget_acquire_replicas(c, s, a, b):
    """
    1. The scheduler sends acquire-replicas to the worker
    2. Before the task can transition to flight, the task is forgotten on the scheduler
       and on the peer workers *holding the replicas*. 
       This is unlike in a compute-task command, where the workers that are fetching
       the key are tracked on the scheduler side and receive a release-key command too.
    3. the task eventually transitions fetch->flight->missing
    4. Test that the task is eventually forgotten everywhere.
    """
    x = c.submit(inc, 2, key="x", workers=[a.address])
    await x
    with freeze_data_fetching(b, jump_start=True):
        s.request_acquire_replicas(b.address, ["x"], stimulus_id="one")
        await wait_for_state("x", "fetch", b)
        x.release()
        while "x" in s.tasks or "x" in a.tasks:
            await asyncio.sleep(0.01)

    while "x" in b.tasks:
        await asyncio.sleep(0.01)

The above test times out on the last line.

Proposed design

At the moment, the scheduler silently ignores missing keys in the request-refresh-who-has message.
The scheduler should instead respond stating "I don't know about this key, you should forget about it too". This would trigger the key to be forgotten on the worker too.

Blockers

CC @fjetter

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-06-11T10:24:31Z

Same problem for tasks in flight:

@gen_cluster(client=True, nthreads=[("", 1)], timeout=3)
async def test_forget_acquire_replicas_flight(c, s, a):
    """If a dependency fetch finishes on a worker after the scheduler already released
    everything, the worker might be stuck with a redundant replica which is never
    cleaned up.
    """
    async with BlockedGatherDep(s.address) as b:
        x = c.submit(inc, 1, key="x", workers=[a.address])
        await x

        s.request_acquire_replicas(b.address, ["x"], stimulus_id="test")
        await b.in_gather_dep.wait()
        assert b.tasks["x"].state == "flight"

        x.release()
        while "x" in s.tasks:
            await asyncio.sleep(0.01)

        b.block_gather_dep.set()
        while b.tasks:
            await asyncio.sleep(0.01)

XREF test_forget_data_not_supposed_to_have for tasks acquired through compute-task

fjetter · 2022-06-15T15:39:36Z

"I don't know about this key, you should forget about it too". This would trigger the key to be forgotten on the worker too.

I remember having similar logic on the scheduler before with "I don't know about this task, please forget it" in various circumstances. This led to deadlocks due to race conditions

Wouldn't a better fix be for the scheduler be instead to remember who it asked to acquire a replica?

crusaderky · 2022-06-15T17:23:49Z

Everything goes through the same batched comms, sequentially.
I can't think of any way to create a race condition, where the key is recreated on the scheduler and the scheduler sends an event to the worker about it before the deletion event is processed, without involving RPC.

The RPC commands that can cause a race condition are:

scatter (it should definitely be redesigned on top of batched comms; should I open an issue about it?)
rebalance (remediation in Rebalance during a moving cluster #4906)
replicate (remediation in Reimplement replicate() using the Active Memory Manager #6578)

[EDIT] see epic #6604

Wouldn't a better fix be for the scheduler be instead to remember who it asked to acquire a replica?

It's possible, but also unnecessarily complicated IMHO.
Related: #5759

crusaderky · 2022-06-17T12:39:40Z

@fjetter are you happy for me to proceed with my design?

fjetter · 2022-06-20T15:01:16Z

Yes, sounds good. I just wanted to highlight the possibility. if batchedsend ordering protects us, that's good news 👍

crusaderky self-assigned this May 31, 2022

crusaderky mentioned this issue Jun 27, 2022

A key is forgotten while acquire-replicas is running #6638

Merged

crusaderky closed this as completed in #6638 Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMM can leave a forgotten task forever in missing state #6479

AMM can leave a forgotten task forever in missing state #6479

crusaderky commented May 31, 2022 •

edited

Loading

crusaderky commented Jun 11, 2022 •

edited

Loading

fjetter commented Jun 15, 2022

crusaderky commented Jun 15, 2022 •

edited

Loading

crusaderky commented Jun 17, 2022 •

edited

Loading

fjetter commented Jun 20, 2022

AMM can leave a forgotten task forever in missing state #6479

AMM can leave a forgotten task forever in missing state #6479

Comments

crusaderky commented May 31, 2022 • edited Loading

Proposed design

Blockers

crusaderky commented Jun 11, 2022 • edited Loading

fjetter commented Jun 15, 2022

crusaderky commented Jun 15, 2022 • edited Loading

crusaderky commented Jun 17, 2022 • edited Loading

fjetter commented Jun 20, 2022

crusaderky commented May 31, 2022 •

edited

Loading

crusaderky commented Jun 11, 2022 •

edited

Loading

crusaderky commented Jun 15, 2022 •

edited

Loading

crusaderky commented Jun 17, 2022 •

edited

Loading