-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
line_profiler results on 4 workers (w/o stealing) over 20 iterations #20
Comments
Taking a look at |
Most of the relevant bits seem to be captured in this particular profile. Have embedded the text below for easier perusing :)
Edit: After more perusing, I think this is the only one that is relevant to us. Have pushed commit ( dask/distributed@3562816 ) to skip writing out trivial profiles, which should simplify things going forward. |
Do you have an idea of what is slow in those functions ? |
No I haven't profiled them yet. Though agree that would be the next step :) |
Ok have gone ahead and profiled all methods and functions called, which took 10% or more of the time. Here are the results for the scheduler.
|
|
So with |
In |
Outside of those issues, a fair bit of time gets spent calling |
With some of these changes merged into
|
Ideally we would want to benchmark |
|
Just to summarize what we have been seeing in terms of major bottlenecks in transition functions, here is a list of transition functions from most time consumed to least. Also functions that took more than 10% (usually much more than that) of the transition function are listed in order of most consuming to least.
Would then add these functions have the following bottlenecks:
These lack any specific bottleneck:
Summary: Hashing and comparisons cause notable slowdowns in major bottlenecks and then also appear throughout due to Edited: To reflect the profile below. |
With some of these changes merged into
|
I haven't looked deeply at these numbers yet. If you don't mind my being lazy, I'm broadly curious about two questions:
|
Sure. I've also made some updates to the summary above. High level the The work on The It would be good to understand what is going on in |
Separately would add that when it comes to Cythonization (as it seems like we are getting closer to that stage), we might want to look at Other functions tend to use one of the |
check_idle_saturated sounds like a great exploratory next step
…On Tue, Nov 24, 2020 at 10:11 PM jakirkham ***@***.***> wrote:
Separately would add that when it comes to Cythonization (as it seems like
we are getting closer to that stage), we might want to look at
check_idle_saturated first. It's not the absolute slowest function, but
it is pretty slow and does pop up in the two slowest transitions either
directly or indirectly. As it's code is pretty well contained and the work
it does is well suited for C compilation (arithmetic, branch, etc.), expect
this one to have the most notable effect without needing to write a lot of
code. That said, this function also is pretty heavily effected by the
__eq__ in Status ( dask/distributed#4270
<dask/distributed#4270> ) spending 40% of its
time there.
Other functions tend to use one of the State objects. So we probably need
to start Cythonizing State objects before delving into them. This might
be the sort of task that parallelizes well over a few devs (cc @quasiben
<https://github.com/quasiben>) 😉
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFCCGUVIZWHJ3QEWLLSRSN2JANCNFSM4T75XNWQ>
.
|
Thanks again for writing up this analysis
…On Tue, Nov 24, 2020 at 10:23 PM Matthew Rocklin ***@***.***> wrote:
check_idle_saturated sounds like a great exploratory next step
On Tue, Nov 24, 2020 at 10:11 PM jakirkham ***@***.***>
wrote:
> Separately would add that when it comes to Cythonization (as it seems
> like we are getting closer to that stage), we might want to look at
> check_idle_saturated first. It's not the absolute slowest function, but
> it is pretty slow and does pop up in the two slowest transitions either
> directly or indirectly. As it's code is pretty well contained and the work
> it does is well suited for C compilation (arithmetic, branch, etc.), expect
> this one to have the most notable effect without needing to write a lot of
> code. That said, this function also is pretty heavily effected by the
> __eq__ in Status ( dask/distributed#4270
> <dask/distributed#4270> ) spending 40% of its
> time there.
>
> Other functions tend to use one of the State objects. So we probably
> need to start Cythonizing State objects before delving into them. This
> might be the sort of task that parallelizes well over a few devs (cc
> @quasiben <https://github.com/quasiben>) 😉
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#20 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTFCCGUVIZWHJ3QEWLLSRSN2JANCNFSM4T75XNWQ>
> .
>
|
As |
Reran with
|
Briefly explored some light type annotations of Also tried the same thing with Additionally Ben and I discussed earlier how we want to handle Cythonization in the benchmarks. Put together PR ( #33 ). Though a few follow on PRs were needed ( #34 ) ( #35 ) ( #36 ), it seems like we have something working on that front now. |
Have taken a pass at annotating one of the |
Just to update this thread a bit, we have since gone through and annotated |
Have started looking at how best to split the Think most of the transitions with the exception of Will update once we have a rough implementation of the refactored |
Did a bit more annotation and optimization along the communication path way with PR ( dask/distributed#4341 ). Though I think that might be as good as that particular path gets for the moment. As much of the communication bits in higher level Python, they are not particularly amenable to Cythonization. Instead I think our best bet there is to defer communication until it is needed and try to handle as much of that together as possible. In an attempt to both handle the communication problem mentioned above and to make it easier to refactor out the transition functions into a separate Cython extension class for more thorough optimization, submitted PR ( dask/distributed#4343 ). This tries to move all communication to before and after transition functions as opposed to within them. |
Using the changes in PR ( dask/distributed#4265 ) and running the
shuffle.py
benchmark with 20 iterations as shown in PR ( #14 ) with 4 workers, here are the results (this was too large to include inline so have attached in a text file). Note that these results are taken from multiple processes, which we breakout in the result file.Edit: Should add some methods may not be run in all cases. So the profile notes this as no time being spent there. These can be ignored.
The text was updated successfully, but these errors were encountered: