Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick worker with lowest memory use by percentage, not absolute #7266

Open
gjoseph92 opened this issue Nov 7, 2022 · 2 comments
Open

Pick worker with lowest memory use by percentage, not absolute #7266

gjoseph92 opened this issue Nov 7, 2022 · 2 comments

Comments

@gjoseph92
Copy link
Collaborator

Currently the worker_objective function uses worker managed memory as a tiebreaker if it looks like a task will start in the same amount of time on multiple workers:

return (start_time, ws.nbytes)

In a heterogeneous cluster, this means we might pick a small worker with less memory available instead of a large worker with lots of memory available, but more total data in memory.

Maybe we should compare by percentage of memory used, rather than total bytes used:

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index eb5828bf..5325af4b 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -3233,7 +3233,7 @@ class SchedulerState:
         if ts.actor:
             return (len(ws.actors), start_time, ws.nbytes)
         else:
-            return (start_time, ws.nbytes)
+            return (start_time, ws.nbytes / ws.memory_limit)
 
     def add_replica(self, ts: TaskState, ws: WorkerState):
         """Note that a worker holds a replica of a task with state='memory'"""

#7248 does this for root tasks when queuing is enabled. I think it would make sense to do in all cases though.

cc @fjetter @crusaderky

@crusaderky
Copy link
Collaborator

crusaderky commented Nov 8, 2022

This boils down to why would people use a heterogeneous cluster.
From my personal experience the answers are chiefly two:

  1. some, but not all, hosts have specific resources (GPU, local services, etc.)
  2. some tasks have much higher heap and/or input and/or output memory usage

In the second case, you'll likely have a wealth of tasks with average memory usage and without any restrictions, plus a handful of tasks with a RAM or HI_MEM or whatever resource restriction. In this use case, you do want to leave more memory free on the hosts mounting more RAM, because it will be needed by the resource-restricted tasks.

FYI - AMM ReduceReplicas, graceful worker retirement, rebalance, and the future AMM Rebalance all use absolute optimistic memory as a metric (I'm fine with scheduling using managed memory instead, as it's a lot more responsive).

@gjoseph92
Copy link
Collaborator Author

That use-case makes sense. However, I just find it hard to justify picking worker A in this situation:

  1. worker A: 400MiB in memory, 300 MiB remaining
  2. worker B: 1 GiB in memory, 2 GiB remaining

To me, using absolute memory is making too much of an assumption that someone's use-case and intent looks like the one you've described. But you might just have heterogeneous workers because that's what you got, whether it's the machines you had around in your lab, or the instances Coiled gave you because you allowed a range of instance types for faster cluster startup time.

I feel like the safest generic choice to make is the one that's the least likely to put a worker under memory pressure. If someone has high-memory workers, but they want to save them for particular tasks, then you can always use more resource restrictions to accomplish that.

As a user, I'd be confused if my low-memory workers kept getting overloaded and dying but my high-memory workers stayed nearly empty, unless I'd explicitly used restrictions to make this happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants