Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully shutdown by ... #8050

Open
mrocklin opened this issue Jul 28, 2023 · 18 comments
Open

Gracefully shutdown by ... #8050

mrocklin opened this issue Jul 28, 2023 · 18 comments

Comments

@mrocklin
Copy link
Member

So, running on Coiled with Spot instances I'm seeing the following behavior:

  1. Get lots of spot instances

  2. Do lots of work

  3. In the middle of that work see machines start to go away.

    Dask handles this nicely, draining the machines onto their peers and then clearing it swiftly, well within the two-minute window.

  4. More machines (pretty much all of them) go away to the point where we lose data.

  5. New on-demand machines come up thirty seconds later and restart the work

In this situation I wish that the spot machines had actually decided to hang around for as long as they thought was possible before offloading their data. We have two minutes. New machines take about a minute to show up. In many cases data transfer happens in 10s or so.

So, a proposal policy on graceful worker shutdown

  1. Accept a desired shutdown time as an input
  2. Stop work immediatetly
  3. Estimate how long it will take to drain the node of data. Look at average inter-worker bandwidth and at how much data we have on this machine.
  4. Wait until the last minute (or suitably before the last minute)
  5. Start transfers
@mrocklin
Copy link
Member Author

cc @ntabris @scharlottej13

@mrocklin
Copy link
Member Author

@fjetter from my perspective this is only a nice-to-have. No particular pressure from me. It also only works if we think we can reliably start workers in around 90s (which I think we usually can, but it's worth verifying).

@fjetter
Copy link
Member

fjetter commented Jul 31, 2023

Not entirely opposed to this but there are a couple of challenges to overcome with the implementation that make this a non-trivial change.

Right now, the Scheduler.retire_workers is protected by a Lock such that multiple retirement requests can only be processed sequentially. If we artificially delay the spot interruptions, we'd likely be in a very bad place if more than one instance is shut down at the same time. I guess we're just lucky so far that the shutdown is too fast.

IIRC this lock was primarily introduced to guard against races with the old, deprecated Schedeler.replicate and Scheduler.rebalance. While we don't particularly care about those, this caused the close_gracefully to be sequential. Removing this lock may therefore introduce a couple of other concurrency issues with retire_workers. I don't see anything obvious that'd be wrong with it but the code is a bit convoluted.

So, either way we should ensure that retire_workers can be called concurrently...

There are a couple of options

  • Kill replicate + rebalance off entirely
  • Replace the lock with a bit smarter condition to make retire_workers mutually exclusive to replicate/rebalance but not to itself
  • Go all in and finally finish up Rebalance during a moving cluster #4906

I'm mentioning the full replicate on this list because this may actually be the objectively better variant anyhow when we're in a scale down + scale up situation since I suspect that we'll end up with a skewed data distribution regardless of this timeout.

@mrocklin
Copy link
Member Author

Right now, the Scheduler.retire_workers is protected by a Lock such that multiple retirement requests can only be processed sequentially

Ah, that explains why I was seeing only singletons downgrade at once. This seems to be a biggish deal with the Spot behavior we're observing. cc'ing @ntabris and @dchudz so that they have context on this limitation.

Probably a separate issue, but how hard would it be to batch some of these?

IIRC this lock was primarily introduced to guard against races with the old, deprecated Schedeler.replicate and Scheduler.rebalance. While we don't particularly care about those, this caused the close_gracefully to be sequential. Removing this lock may therefore introduce a couple of other concurrency issues with retire_workers. I don't see anything obvious that'd be wrong with it but the code is a bit convoluted.

Oh, I'd very happily nuke client-side rebalance/replicate in order to get better retire_workers functionality.

So, either way we should ensure that retire_workers can be called concurrently...

+1

@crusaderky
Copy link
Collaborator

crusaderky commented Jul 31, 2023

So, a proposal policy on graceful worker shutdown

  1. Accept a desired shutdown time as an input
  2. Stop work immediatetly
  3. Estimate how long it will take to drain the node of data. Look at average inter-worker bandwidth and at how much data we have on this machine.
  4. Wait until the last minute (or suitably before the last minute)
  5. Start transfers

This should not be necessary.
Hosts will completely stop gathering data when they reach the pause threshold.
If they still die, it means that the delta between terminate and pause is too small and can't accommodate

  1. the heap of tasks that were already running at the moment of pausing, plus
  2. all gather_dep requests that were already in flight at the moment of pausing, plus
  3. one get_data request, which will likely cause data to unspill, plus
  4. general unmanaged memory, e.g. modules, fragmentation, etc.

What does the memory look like just before a host dies? Is there any managed memory still there, or has it all been spilled away and it all looks unmanaged?

@crusaderky
Copy link
Collaborator

Oh, I'd very happily nuke client-side rebalance/replicate in order to get better retire_workers functionality.

+1. When and if we reimplement them on top of AMM, retaining compatibility with the current API would be a colossal pain anyway.

@ntabris
Copy link
Contributor

ntabris commented Jul 31, 2023

This should not be necessary.

I'm not following. The aim of the proposal is (1) to get more useful work out of the instance before it does graceful shutdown and (2) to have a better chance to have replacement instances running before graceful shutdown.

Are you saying there's already a way to accomplish that? Or that there's a better way to change things in order to accomplish them? Or that you disagree with those aims?

@crusaderky
Copy link
Collaborator

The aim of this ticket is that healthy workers should not die due to OOM when spot instances are being retired.

Squeezing 2 extra minutes of work out of instances that will shut down soon is pointless IMHO. And this is the person who always obsesses about performance talking.

What I'm saying is that graceful worker retirement on A must not cause death-by-OOM on B, and that's the problem that must be solved.

@ntabris
Copy link
Contributor

ntabris commented Jul 31, 2023

Squeezing 2 extra minutes of work out of instances that will shut down soon is pointless IMHO. And this is the person who always obsesses about performance talking.

What I'm saying is that graceful worker retirement on A must not cause death-by-OOM on B, and that's the problem that must be solved.

That certainly sounds like one problem to be solved.

Wouldn't it be extra nice if graceful shutdown could transfer data to fresh workers, though? Or are you thinking that doesn't matter in most cases?

@fjetter
Copy link
Member

fjetter commented Jul 31, 2023

Wouldn't it be extra nice if graceful shutdown could transfer data to fresh workers, though? Or are you thinking that doesn't matter in most cases?

This is basically what happens if the new worker is already up. We're moving the data to the workers with the least data in memory

@fjetter
Copy link
Member

fjetter commented Jul 31, 2023

The "possibly death by OOM due to graceful downscaling" situation is something unrelated to this specific ticket, isn't it?

@ntabris
Copy link
Contributor

ntabris commented Jul 31, 2023

This is basically what happens if the new worker is already up.

Yes, and part of the point (I thought) of delayed worker shutdown is to give the new worker a chance to show up.

@crusaderky
Copy link
Collaborator

The "possibly death by OOM due to graceful downscaling" situation is something unrelated to this specific ticket, isn't it?

Is it? I thought it was the central point of the opening post. Or did I misread this?

More machines (pretty much all of them) go away to the point where we lose data.

@fjetter
Copy link
Member

fjetter commented Aug 2, 2023

Is it? I thought it was the central point of the opening post. Or did I misread this?

I think you misread. It's not about not having sufficient memory but rather about not moving the data to a full/busy worker if there is a fresh one starting in a minute. This is about utilization, avoiding data skew, etc.

@consideRatio
Copy link
Contributor

consideRatio commented Aug 22, 2023

I was interested in various cloud providers termination grace duration for spot/pre-emptible instances, and conclude that there are 120 seconds for AWS, and 30 seconds for GCP and Azure.

AWS: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
GCP: https://cloud.google.com/compute/docs/instances/spot
Azure: https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms

@jacobtomlinson
Copy link
Member

I wonder if data could be spilled somewhere in a way that the worker can go away before a new one arrives. Maybe to a peer or to the scheduler, but not put back into active processing until there is a worker that can take it.

@crusaderky
Copy link
Collaborator

crusaderky commented Aug 22, 2023

Is it? I thought it was the central point of the opening post. Or did I misread this?

I think you misread. It's not about not having sufficient memory but rather about not moving the data to a full/busy worker if there is a fresh one starting in a minute. This is about utilization, avoiding data skew, etc.

Then IMHO this is a Coiled issue, not a distributed one. Coiled should just delay the call to retire_workers.
In the ~1 minute window when the new worker is going up, I see little value in stopping new work from being sent to the dying worker.

Pseudocode in Coiled now:

async def aws_said_worker_will_die_in_3_minutes(addr):
     await asyncio.gather(
         scheduler.retire_worker(addr),
         spawn_new_worker(),
    )

New pseudocode in Coiled:

async def aws_said_worker_will_die_in_3_minutes(addr):
     try:
         await asyncio.wait_for(spawn_new_worker(), timeout=90))
     finally:
         await scheduler.retire_worker(addr)

@crusaderky
Copy link
Collaborator

I wonder if data could be spilled somewhere in a way that the worker can go away before a new one arrives. Maybe to a peer or to the scheduler, but not put back into active processing until there is a worker that can take it.

The current behaviour is to distribute all unique data to peer workers, preferring those with the lowest memory pressure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants