Gracefully shutdown by ... #8050

mrocklin · 2023-07-28T18:13:19Z

So, running on Coiled with Spot instances I'm seeing the following behavior:

Get lots of spot instances
Do lots of work
In the middle of that work see machines start to go away.

Dask handles this nicely, draining the machines onto their peers and then clearing it swiftly, well within the two-minute window.
More machines (pretty much all of them) go away to the point where we lose data.
New on-demand machines come up thirty seconds later and restart the work

In this situation I wish that the spot machines had actually decided to hang around for as long as they thought was possible before offloading their data. We have two minutes. New machines take about a minute to show up. In many cases data transfer happens in 10s or so.

So, a proposal policy on graceful worker shutdown

Accept a desired shutdown time as an input
Stop work immediatetly
Estimate how long it will take to drain the node of data. Look at average inter-worker bandwidth and at how much data we have on this machine.
Wait until the last minute (or suitably before the last minute)
Start transfers

mrocklin · 2023-07-28T18:13:28Z

cc @ntabris @scharlottej13

mrocklin · 2023-07-28T18:29:01Z

@fjetter from my perspective this is only a nice-to-have. No particular pressure from me. It also only works if we think we can reliably start workers in around 90s (which I think we usually can, but it's worth verifying).

fjetter · 2023-07-31T09:37:40Z

Not entirely opposed to this but there are a couple of challenges to overcome with the implementation that make this a non-trivial change.

Right now, the Scheduler.retire_workers is protected by a Lock such that multiple retirement requests can only be processed sequentially. If we artificially delay the spot interruptions, we'd likely be in a very bad place if more than one instance is shut down at the same time. I guess we're just lucky so far that the shutdown is too fast.

IIRC this lock was primarily introduced to guard against races with the old, deprecated Schedeler.replicate and Scheduler.rebalance. While we don't particularly care about those, this caused the close_gracefully to be sequential. Removing this lock may therefore introduce a couple of other concurrency issues with retire_workers. I don't see anything obvious that'd be wrong with it but the code is a bit convoluted.

So, either way we should ensure that retire_workers can be called concurrently...

There are a couple of options

Kill replicate + rebalance off entirely
Replace the lock with a bit smarter condition to make retire_workers mutually exclusive to replicate/rebalance but not to itself
Go all in and finally finish up Rebalance during a moving cluster #4906

I'm mentioning the full replicate on this list because this may actually be the objectively better variant anyhow when we're in a scale down + scale up situation since I suspect that we'll end up with a skewed data distribution regardless of this timeout.

mrocklin · 2023-07-31T13:57:52Z

Right now, the Scheduler.retire_workers is protected by a Lock such that multiple retirement requests can only be processed sequentially

Ah, that explains why I was seeing only singletons downgrade at once. This seems to be a biggish deal with the Spot behavior we're observing. cc'ing @ntabris and @dchudz so that they have context on this limitation.

Probably a separate issue, but how hard would it be to batch some of these?

IIRC this lock was primarily introduced to guard against races with the old, deprecated Schedeler.replicate and Scheduler.rebalance. While we don't particularly care about those, this caused the close_gracefully to be sequential. Removing this lock may therefore introduce a couple of other concurrency issues with retire_workers. I don't see anything obvious that'd be wrong with it but the code is a bit convoluted.

Oh, I'd very happily nuke client-side rebalance/replicate in order to get better retire_workers functionality.

So, either way we should ensure that retire_workers can be called concurrently...

+1

crusaderky · 2023-07-31T15:16:30Z

So, a proposal policy on graceful worker shutdown

Accept a desired shutdown time as an input

Stop work immediatetly

Estimate how long it will take to drain the node of data. Look at average inter-worker bandwidth and at how much data we have on this machine.

Wait until the last minute (or suitably before the last minute)

Start transfers

This should not be necessary.
Hosts will completely stop gathering data when they reach the pause threshold.
If they still die, it means that the delta between terminate and pause is too small and can't accommodate

the heap of tasks that were already running at the moment of pausing, plus
all gather_dep requests that were already in flight at the moment of pausing, plus
one get_data request, which will likely cause data to unspill, plus
general unmanaged memory, e.g. modules, fragmentation, etc.

What does the memory look like just before a host dies? Is there any managed memory still there, or has it all been spilled away and it all looks unmanaged?

crusaderky · 2023-07-31T15:21:14Z

Oh, I'd very happily nuke client-side rebalance/replicate in order to get better retire_workers functionality.

+1. When and if we reimplement them on top of AMM, retaining compatibility with the current API would be a colossal pain anyway.

ntabris · 2023-07-31T15:25:28Z

This should not be necessary.

I'm not following. The aim of the proposal is (1) to get more useful work out of the instance before it does graceful shutdown and (2) to have a better chance to have replacement instances running before graceful shutdown.

Are you saying there's already a way to accomplish that? Or that there's a better way to change things in order to accomplish them? Or that you disagree with those aims?

crusaderky · 2023-07-31T15:38:45Z

The aim of this ticket is that healthy workers should not die due to OOM when spot instances are being retired.

Squeezing 2 extra minutes of work out of instances that will shut down soon is pointless IMHO. And this is the person who always obsesses about performance talking.

What I'm saying is that graceful worker retirement on A must not cause death-by-OOM on B, and that's the problem that must be solved.

ntabris · 2023-07-31T15:44:07Z

Squeezing 2 extra minutes of work out of instances that will shut down soon is pointless IMHO. And this is the person who always obsesses about performance talking.

What I'm saying is that graceful worker retirement on A must not cause death-by-OOM on B, and that's the problem that must be solved.

That certainly sounds like one problem to be solved.

Wouldn't it be extra nice if graceful shutdown could transfer data to fresh workers, though? Or are you thinking that doesn't matter in most cases?

fjetter · 2023-07-31T16:09:32Z

Wouldn't it be extra nice if graceful shutdown could transfer data to fresh workers, though? Or are you thinking that doesn't matter in most cases?

This is basically what happens if the new worker is already up. We're moving the data to the workers with the least data in memory

fjetter · 2023-07-31T16:09:58Z

The "possibly death by OOM due to graceful downscaling" situation is something unrelated to this specific ticket, isn't it?

ntabris · 2023-07-31T16:41:53Z

This is basically what happens if the new worker is already up.

Yes, and part of the point (I thought) of delayed worker shutdown is to give the new worker a chance to show up.

crusaderky · 2023-07-31T16:51:12Z

The "possibly death by OOM due to graceful downscaling" situation is something unrelated to this specific ticket, isn't it?

Is it? I thought it was the central point of the opening post. Or did I misread this?

More machines (pretty much all of them) go away to the point where we lose data.

fjetter · 2023-08-02T08:35:50Z

Is it? I thought it was the central point of the opening post. Or did I misread this?

I think you misread. It's not about not having sufficient memory but rather about not moving the data to a full/busy worker if there is a fresh one starting in a minute. This is about utilization, avoiding data skew, etc.

consideRatio · 2023-08-22T11:33:21Z

I was interested in various cloud providers termination grace duration for spot/pre-emptible instances, and conclude that there are 120 seconds for AWS, and 30 seconds for GCP and Azure.

AWS: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
GCP: https://cloud.google.com/compute/docs/instances/spot
Azure: https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms

jacobtomlinson · 2023-08-22T13:45:07Z

I wonder if data could be spilled somewhere in a way that the worker can go away before a new one arrives. Maybe to a peer or to the scheduler, but not put back into active processing until there is a worker that can take it.

crusaderky · 2023-08-22T15:06:42Z

Is it? I thought it was the central point of the opening post. Or did I misread this?

I think you misread. It's not about not having sufficient memory but rather about not moving the data to a full/busy worker if there is a fresh one starting in a minute. This is about utilization, avoiding data skew, etc.

Then IMHO this is a Coiled issue, not a distributed one. Coiled should just delay the call to retire_workers.
In the ~1 minute window when the new worker is going up, I see little value in stopping new work from being sent to the dying worker.

Pseudocode in Coiled now:

async def aws_said_worker_will_die_in_3_minutes(addr):
     await asyncio.gather(
         scheduler.retire_worker(addr),
         spawn_new_worker(),
    )

New pseudocode in Coiled:

async def aws_said_worker_will_die_in_3_minutes(addr):
     try:
         await asyncio.wait_for(spawn_new_worker(), timeout=90))
     finally:
         await scheduler.retire_worker(addr)

crusaderky · 2023-08-22T15:07:38Z

I wonder if data could be spilled somewhere in a way that the worker can go away before a new one arrives. Maybe to a peer or to the scheduler, but not put back into active processing until there is a worker that can take it.

The current behaviour is to distribute all unique data to peer workers, preferring those with the lowest memory pressure.

github-actions bot added the needs triage label Jul 28, 2023

fjetter mentioned this issue Jul 31, 2023

Concurrent Scheduler.retire_workers #8053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully shutdown by ... #8050

Gracefully shutdown by ... #8050

mrocklin commented Jul 28, 2023

mrocklin commented Jul 28, 2023

mrocklin commented Jul 28, 2023

fjetter commented Jul 31, 2023 •

edited

Loading

mrocklin commented Jul 31, 2023

crusaderky commented Jul 31, 2023 •

edited

Loading

crusaderky commented Jul 31, 2023

ntabris commented Jul 31, 2023

crusaderky commented Jul 31, 2023

ntabris commented Jul 31, 2023

fjetter commented Jul 31, 2023

fjetter commented Jul 31, 2023

ntabris commented Jul 31, 2023

crusaderky commented Jul 31, 2023

fjetter commented Aug 2, 2023

consideRatio commented Aug 22, 2023 •

edited

Loading

jacobtomlinson commented Aug 22, 2023

crusaderky commented Aug 22, 2023 •

edited

Loading

crusaderky commented Aug 22, 2023

Gracefully shutdown by ... #8050

Gracefully shutdown by ... #8050

Comments

mrocklin commented Jul 28, 2023

mrocklin commented Jul 28, 2023

mrocklin commented Jul 28, 2023

fjetter commented Jul 31, 2023 • edited Loading

mrocklin commented Jul 31, 2023

crusaderky commented Jul 31, 2023 • edited Loading

crusaderky commented Jul 31, 2023

ntabris commented Jul 31, 2023

crusaderky commented Jul 31, 2023

ntabris commented Jul 31, 2023

fjetter commented Jul 31, 2023

fjetter commented Jul 31, 2023

ntabris commented Jul 31, 2023

crusaderky commented Jul 31, 2023

fjetter commented Aug 2, 2023

consideRatio commented Aug 22, 2023 • edited Loading

jacobtomlinson commented Aug 22, 2023

crusaderky commented Aug 22, 2023 • edited Loading

crusaderky commented Aug 22, 2023

fjetter commented Jul 31, 2023 •

edited

Loading

crusaderky commented Jul 31, 2023 •

edited

Loading

consideRatio commented Aug 22, 2023 •

edited

Loading

crusaderky commented Aug 22, 2023 •

edited

Loading