Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queuing: lowest-memory worker as tiebreaker #7248

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gjoseph92
Copy link
Collaborator

This could be another way of approaching #7197 in a practical sense. AFAIU the point of the round-robin behavior is that maybe, if you keep re-using the same worker, its memory will slowly creep up (if you're running a task that leaks memory or something)? So if we just use memory as a tie-breaker, you'd probably get round-robin in practice on an idle cluster.

Would need to add a simple test.

Closes #7197

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 29m 15s ⏱️ +26s
  3 171 tests +  3    3 085 ✔️ +2    83 💤 ±  0    3 +1 
23 464 runs  +24  22 544 ✔️ +8  910 💤 +10  10 +6 

For more details on these failures, see this check.

Results for commit 6f0bbc5. ± Comparison against base commit 02b9430.

♻️ This comment has been updated with latest results.

@gjoseph92
Copy link
Collaborator Author

I've tried to add a test here but I'm having a hard time getting it to work. (The test depends on #7221 BTW.)

@crusaderky I'm trying to use optimistic memory as the tiebreaker here, but it doesn't seem like the right metric. What do you think about this?

The problem is that when we've received task-finished from a worker, but not another heartbeat yet, optimistic memory may not include the nbytes we know are on the worker. So we will keep picking that worker (we picked it in the first place since its memory was the lowest) until the next heartbeat.

I think I should probably just change this to nbytes instead of trying to use the more complex memory measures—using high-latency metrics-based measures for scheduling doesn't feel quite right anyway.

I was just hoping to be able to make it so that in an idle cluster, if there's a worker with high unmanaged memory use, we wouldn't pick it, even if other workers had a small amount of managed memory. I think this would get at the original intent of #4637.

@crusaderky
Copy link
Collaborator

crusaderky commented Nov 7, 2022

AFAIU the point of the round-robin behavior is that maybe, if you keep re-using the same worker, its memory will slowly creep up (if you're running a task that leaks memory or something)?

From what I can read, the round-robin algorithm prevents managed memory from piling up on a single worker.
e.g.

futures = []
for i in range(100):
    fut = c.submit(numpy.random, 2**24, key=f"x{i}")  # 128 MiB
    wait(fut)
    futures.append(fut)

Without round-robin, all 100 tasks would complete on the same worker and stay there.

The problem is that when we've received task-finished from a worker, but not another heartbeat yet, optimistic memory may not include the nbytes we know are on the worker. So we will keep picking that worker (we picked it in the first place since its memory was the lowest) until the next heartbeat.

Correct, all memory measures except managed come from the heartbeat and as such suffer delays and are not a good choice for decisions that rely on split-second updates to the cluster state. When you complete a task, managed will go up, but process will remain the same until the next heartbeat. This will cause the size of the task that just completed to be temporarily detracted from unmanaged_recent. This behaviour is correct for slow-ish task, and not so much for fast ones:

e.g. this task will produce very accurate readings on the scheduler:

def f():
    a = numpy.zeros(2**24)
    sleep(5)
    return a
  1. It will be accounted as 128mb of unmanaged_recent between the first hearbeat after it started and the moment the scheduler receives {op: task-finished}
  2. when the message arrive, it will be accounted for 128mb of managed memory and 0 unmanaged_recent
  3. when the heartbeat arrives several seconds later, it will confirm what the scheduler already knows

By contrast, a task that is much faster to complete than the heartbeat will produce misleading readings in the short term:

def f():
    return numpy.zeros(2**24)
  1. a heartbeat will likely never capture the task while it's running
  2. when {op: task-finished} reaches the scheduler, the task will be accounted for +128mb managed and -128mb unamanged_recent (all subtotals are floored to 0)
  3. when the heartbeat finally arrives, it will produce the correct numbers.

As you see there isn't a way that is right for all, and I don't think calling psutil every time a task finishes would be a good idea, performance-wise.

I think I should probably just change this to nbytes

This sounds like a good choice to me - better than the current round-robin, in fact.

@gjoseph92
Copy link
Collaborator Author

Changed to just use ws.nbytes.

Tests will still fail until #7221.

test_decide_worker_memory_tiebreaker_idle_heterogeneous_cluster is skipped when queuing is disabled. I propose we discuss that separately in #7266.

@gjoseph92 gjoseph92 marked this pull request as ready for review November 7, 2022 19:11
@gjoseph92
Copy link
Collaborator Author

Realized that #7221 accidentally slipped in here. With that removed, test_decide_worker_memory_tiebreaker_idle_cluster fails both here and on main, queuing on or off, because of #7274 and #7197 (comment).

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Round-robin worker selection makes poor choices with worker-saturation > 1.0
2 participants