Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

work stealing seems to not occur fully when launching tasks from tasks #4945

Open
bolliger32 opened this issue Jun 21, 2021 · 2 comments
Open

Comments

@bolliger32
Copy link

bolliger32 commented Jun 21, 2021

What happened:

I have a workflow in which long-running tasks are launched from within tasks, following the instructions for use of the worker_client context manager in the docs. In theory, according to the docs, each time the context manager is invoked, it will secede from the thread pool and then signal that the worker is able to accept another task. However, I've run into an issue where some workers will have no more tasks to process (i.e. they will only have these long-running, seceded tasks) while other workers will have a long backlog of "processing" tasks that are not actively being run but could be run on the now idle workers. When I say "long" I mean that it will take many minutes to run through those tasks, while shipping any data necessary to begin these tasks to another worker would take <1s. I mentioned this issue in a comment to #4906 but it was pointed out that this was not relevant to that issue. It's possible this situation has been brought up in another issue and if so, my apologies.

What you expected to happen:
In this situation, when the backlog on the active worker seems much longer than the time it would take to send any necessary dependencies over to the idle worker, I would expect some work stealing to happen and for the idle worker to start picking up some of those tasks. I cannot find any reason in the work stealing docs why this shouldn't happen

Minimal Complete Verifiable Example:

from dask.distributed import Client, worker_client
from time import sleep

client = Client(n_workers=4, threads_per_worker=1)

def take_some_time(time):
    sleep(time)
    return time

def launcher(time):
    with worker_client() as client:
        return client.submit(take_some_time, time).result()

short_running_fut = client.map(take_some_time, np.linspace(0, .5, 50), priority=0)
long_running_fut = client.map(launcher, [1000, 1001, 1002], priority=1)

In this example, the short-running take_some_time tasks are typically dispersed evenly across workers. Before they can complete, the higher-priority launcher task will typically get assigned to 3 different workers, which will then map the long-running take_some_time tasks to 2 or 3 workers. Those workers will start running this long-running task and now will have a backlog of short-running tasks in their queue. The 1 or 2 workers that are not currently running a long-running take_some_time tasks will spin through their shorter-running tasks and then sit idle.

We end up with an allocation like this:
image

where the worker with just 1 processing task is a worker that has a seceded launcher task but nothing else. In this situation, I would expect it to start stealing tasks from the backlog of Worker 1.

If I simply change launcher in the final line to take_some_time, then eventually all of the tasks get dumped onto the 1 worker that's not running a long-running task (as expected). If I then set the distributed.scheduler.work-stealing config setting to False, this does not occur (and the short running tasks stay piled up in the queue of the active workers, while one worker stays idle)

It appears that some work-stealing does occur when launching tasks from tasks. If I run the original code (i.e. with mapping the launcher function not the take_some_time function in the last line) after having disabled work stealing, I get a distribution something like this:
image

which looks categorically different than what I was seeing originally. So, it appears that launching tasks from tasks has some effect in confusing the standard work stealing algorithm, but it does not disable it entirely. My guess is that in the context of evaluating work stealing, the scheduler is seeing the long-running task as something actively being computed on that worker, rather than treating the worker as idle and ready for more work?

Anything else we need to know?:
It's totally possible that I just don't understand work-stealing correctly, and there is a way to set this up such that I can avoid this behavior. Regardless, if that's the case, I would be really grateful to hear it.

Environment:

  • Dask version: 2021.06.1
  • Python version: 3.8.10
  • Operating System: Linux
  • Install method (conda, pip, source): conda
@bolliger32 bolliger32 changed the title work stealing seems to not occur when launching tasks from tasks work stealing seems to not occur fully when launching tasks from tasks Jun 21, 2021
@fjetter
Copy link
Member

fjetter commented Jun 22, 2021

I had a look at the code and changed a few things to make it runnable (see below) but that didn't change much of the logic, just boilerplate stuff

What I could observe is that the tasks properly secede and the worker is free to work. I do actually receive CancelledErrors from the in-worker clients I haven't debugged, yet, but the seceding seems to work fine and as intended. Do you observer the cancelled errors as well?

Regarding the work stealing, it might be connected to #4471 (PR #4920) which fixes an issue where we blacklist too-fast tasks. in your example, dask executes the take_some_time with zero sleep and it learns that it is ultra-fast such that it is never worth to be stolen. only upon execution of later keys it learns that the tasks seem to be slower than originally thought. However, we also allow tasks without dependencies to always be stolen so I might need to dig deeper.

Can you please confirm the below code snippet and tell me if it finishes as intended and shows the behaviour you are seeing?

from time import sleep

import numpy as np

from dask.distributed import Client, worker_client


def main():

    client = Client(n_workers=4, threads_per_worker=1)
    print(client.cluster.dashboard_link)

    def take_some_time(time):
        sleep(time)
        return time

    def launcher(time):
        with worker_client() as client:
            return client.submit(take_some_time, time).result()

    short_running_fut = client.map(take_some_time, np.linspace(0, 0.5, 50), priority=0)
    long_running_fut = client.map(launcher, [5 * 60] * 3, pure=False, priority=1)
    client.gather(short_running_fut, long_running_fut)


if __name__ == "__main__":
    main()

@bolliger32
Copy link
Author

Thanks for taking a look @fjetter ! Yes, confirmed that the two snippets produce the same behavior (mine was just running in a notebook, but I confirmed that the one you posted runs as a script). Interestingly, I do not get CancelledErrors when running this either on a LocalCluster or a GatewayCluster w/ a Kubernetes backend.

On another note, it's also possible that my MRE was simplified too much from my actual observed situation. There are some other characteristics that might be important in the real example:

  1. None of the tasks that get held up in the real context (i.e. the jobs analogous to take_some_time) are quite as fast as the sleep(0) case that you mentioned as one reason that work doesn't get stolen. In reality, it's a couple of different tasks, some of which are faster and others are slower.
  2. The real tasks also do have dependencies.

That being said, the behavior is similar to what is observed in this example so I'm not sure these are the reasons that the work stealing is getting held up. Also, when I repeat the exercise with long_running_fut launching take_some_time directly, rather than through launcher, the work stealing occurs as expected.

Let me know if I can provide any more info, or if there are any debugging/diagnostic steps I could help out with. Is #4920 ready to test out? I could try just installing from that branch and seeing what happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants