-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add threadpool support to runtime #42302
Conversation
If we go this way, I think we need to remove |
|
b060dba
to
943d4fb
Compare
|
julia/stdlib/LinearAlgebra/src/LinearAlgebra.jl Lines 575 to 576 in e50603f
... julia/stdlib/LinearAlgebra/src/LinearAlgebra.jl Lines 589 to 591 in e50603f
so, I think the PR as-is breaks This is also the "intended" usage pattern of Lines 16 to 23 in e50603f
That said, the use of julia/stdlib/LinearAlgebra/src/matmul.jl Lines 778 to 780 in e50603f
Maybe it's better to just get rid of the buffer pooling until we have a proper solution (e.g., "singleton" initialization API using the double checked locking pattern). |
base/threadingconstructs.jl
Outdated
function spawn_threadpool(n::Int) | ||
wq = [Base.StickyWorkqueue() for _ in 1:n] | ||
push!(Base.AllWorkqueues, wq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This push!
is racy since it could resize the internal buffer of AllWorkqueues
. We'd need a lock on AllWorkqueues
or a proper concurrent collection or just a CAS loop on a ref.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, other things in jl_start_threads_
(called from here) like jl_threadpools++
are also racy. I guess we also need to update jl_all_tls_states
here which is tricky (since GC can start anytime).
Actually, since the nthreads for this new pool needs to match in the Julia and C side, protecting AllWorkqueues
is not enough. Maybe it's easier to just use a lock here and making sure to acquire it before every time jl_start_threads_dedicated
is called. But synchronization of jl_all_tls_states
still needs something else...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we can do on the runtime side is to only (atomically) increment jl_threadpools
once all the runtime and Julia pieces are fully setup, and have all arrays be fully sized (to the max number of threads/threadpools we'll allow, which can be a build-time parameter) at init time. I also agree that jl_start_threads_dedicated
can take a lock to ensure serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fun(?) part is jl_all_tls_states
. I'd imagine something like this (in C)
new = copy(jl_all_tls_states)
new_ptls = [create_ptls() for _ in 1:n]
append!(new, new_ptls)
# No need for CAS since `jl_start_threads_dedicated` (or its caller)
# acquires the lock
@atomic jl_all_tls_states = new
# *Then* make the new ptls available atomically by setting `jl_n_threads`
@atomic jl_n_threads = length(new)
# Here, we still can't free `old` because there can be
# `jl_gc_wait_for_the_world` etc.
GC.gc(false)
# Once we run GC and stooped all existing thread, we can be sure there
# are no stale accesses to `jl_all_tls_states`
free(old)
...or maybe just leak old jl_all_tls_states
to simply things.
Alternatively, we can use a linked list (of blocks) for jl_all_tls_states
(but it maybe slows things down.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tkf now that threadpool spawning is done only at startup, would you mind re-reviewing? I'm not sure what kinds of atomics t->threadpool = xyz
needs (in the runtime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already raised concern in the "social" aspect in #41586 (comment) but from the technical standing point, the direction of this PR looks good to me.
I think the tricky parts that are not implemented yet are how to handle concurrent calls to jl_start_threads_dedicated
and also how to make it correct with respect to concurrent calls to other internals like jl_gc_wait_for_the_world
.
Also, maybe we need to make sure to call |
cc: @chriselrod |
If I understand correctly, this is what ThreadingUtilities/PolyesterWeave want. But I imagine over prescribing threads will still be a concern? |
This is a major concern for me as well. |
My reason for implementing this feature in the runtime is focused on I/O, not on computation. Only libraries which need these isolated threadpools to prevent deadlocks (like CUDA.jl/AMDGPU.jl for hostcalls, or Gtk.jl for its event loop) should use this feature by default; other libraries could allow this as an opt-in feature, but I would recommend against it. Another feature that we will need that should better suit alternative task scheduler implementations would be the ability to forward task scheduling decisions to a library instead of the built-in PARTR algorithm. |
Yes, I want to echo Julian and this feature should only be used for latency-critical (I/O) parts of the programs, and not throughput-oriented ones. I cannot emphasize enough that this PR relies heavily on the cooperation from the community to not use this feature as much as possible, especially compute-heavy ones. If every package uses its own thread pool, we'll lose composability. If a package uses its own thread pool, sure, it makes its micro-benchmark look awesome. But it does not mean it performs well when you combine multiple packages together. (But please don't take it as a direct criticism to Polyester.jl et al. I think it's great that we are working on multiple fronts to improve parallelism in Julia. But I think it'd be better to try converging to single resource management eventually so that multiple components in the ecosystem work well together.) |
I agree that "working well together" is preferable, but an independent thread pool (like BLAS currently uses) is preferable to potentially causing problems, and will be a much simpler adjustment to make. On this end:
I'd be happy to have some API for communicating with the PARTR scheduler. A fairly good and perhaps simple soltuion would be to have |
I think the philosophy behind Julia's parallelism design is that the BLAS model of managing computing resources at the library level doesn't work well when embedded in a large parallel program. I think beating BLAS using a composable way would be much more exciting than beating BLAS using the BLAS way. (Even though the latter is an excellent achievement on its own.) Julia's parallelism is meant to go beyond per-library static scheduling. |
There also has been the goal from the beginning to be able to replace BLAS thread pool and make it use the Julia tasks (as FFTW does) to enable composability and to avoid oversubscription. |
W.r.t BLAS: I thought the goal was to allow OpenBLAS to use our scheduler. Basically, we should push for libraries to have an API that allows passing them a scheduler, not passing over scheduling decisions to them. Edit: Jinx. |
Anyway, OpenBLAS's single and especially multithreaded performance are pretty bad, with a few exceptions. OpenBLAS's LU factorization is actively slower multithreaded than single threaded unless the matrices are very large on recent CPUs with lots of threads. And that is the case with OpenBLAS's current threading, which is lower overhead than Julia's. Thus, IMO, having BLAS compose with Julia's threading is a worse solution than |
Regardless of whether a given BLAS routine is maximally efficient in a multithreaded setup, trying to execute other code concurrently on the main threadpool with an external threadpool will harm the efficiency of both pools by cache thrashing. This is something that PARTR has the ability to avoid by co-scheduling related tasks together (such as the tasks comprising a BLAS call) and excluding unrelated tasks temporarily. If we're seeing evidence that PARTR isn't doing that job effectively for Julia code, we should fix PARTR, instead of giving up and using threadpools to make just one or two libraries work most efficiently in isolation. |
As I said before, because of the lack of efficiency of many BLAS/LAPACK routines, This is why my nearly ideal solution is to be able to query the PARTR scheduler if it's currently scheduling multiple tasks. If so, PolyesterWeave could avoid allocating any threads at all, allowing it to be multithreaded in otherwise serial execution, but single threaded in otherwise parallel execution -- following the assumption that the threading on the other level will be more efficient.
How many years will that take? However, I'm also not a fan of depth-first scheduling anyway and have low expectations. Between expecting bad results in a matter of years and satisfactory results that also give me control over performance in a time-frame of weeks at most (or immediately, given that I've already written ThreadingUtilties, etc), my choices are obvious. To be clear: But it would help libraries like RecursiveFactorization.jl, important for solving stiff ODEs, to have this autoswitching ability. |
Ok, we can probably make that happen, by exposing a C function that returns the size of the multiq heaps, or you might just be able to check
Probably on the order of a month or two, if we have a good MWE.
Care to explain why? |
That'd be great. julia> function workqueuelen(iter)
res = Matrix{Int}(undef, Threads.nthreads()+1, iter)
Threads.@threads for i ∈ 1:iter
for j ∈ 1:Threads.nthreads()
res[j,i] = length(Base.Workqueues[j])
end
res[Threads.nthreads()+1,i] = length(Base.Workqueue)
end
res
end
workqueuelen (generic function with 2 methods)
julia> wqls2 = workqueuelen(1_000_000);
julia> count(!iszero, wqls2)
575323
julia> sum(wqls2)
575323
julia> sum(wqls2) / length(wqls2)
0.019838724137931033 I.e., it is mostly
Yeah, if I had that available I could use it to avoid conflicts instead of threadpools and make everyone in this thread happy. ;) |
I think that's somewhat expected, since tasks can also be placed in the PARTR multiq for DFS.
Cool, then I'll put together a PR in the next week or so to expose multiq sizes. |
Querying the scheduler state is not a great API. It's racy. Also, in theory, it can increase cache traffic and move otherwise thread-local states out of exclusive mode (although the work queue we currently use probably isn't designed to get beneficial effects from this). More importantly, the user is now responsible for making sure the result is identical no matter what the scheduler state is. Using such an API is as hard as using Querying the global state is not the best mindset in concurrent programming, IMHO. Each component should act with mostly local knowledge with minimal synchronizations. The point is designing the system that self-organizes into efficient machinery when things are put together. |
will this PR also allow for changing the amount of threads in the first thread pool? This would be a great feature and also would prevent that people dynamically create new thread pools just because the size of the first one cannot be changed. You mentioned Gtk.jl as a use case. Would be very interesting if this could help resolving this issue: JuliaGraphics/Gtk.jl#503 |
No, that's intentionally a non-goal because it has the potential to break existing code that relies on
Yeah, this should be solvable by placing the Gtk event loop into its own threadpool, where it can execute for 100% of its thread's time slice if desired. |
ccfa09c
to
411077c
Compare
11637f1
to
f58cf21
Compare
Adds support for Julia to be started with `--threads=auto|N[,M]` where `N` specifies the number of threads in the default threadpool and `M`, if provided, specifies the number of threads in the new interactive threadpool. Adds an optional first parameter to `Threads.@spawn`: `[:default|:interactive]`. If `:interactive` is specified, the task will be run by thread(s) in the interactive threadpool only (if there is one). Co-authored-by: K Pamnany <kpamnany@users.noreply.github.com>
f58cf21
to
c7ea709
Compare
nt = UInt32(Threads._nthreads_in_pool(tpid)) | ||
if heap_c * nt <= heap_p | ||
return heap_p | ||
end | ||
|
||
heap_p += heap_c * nt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the wait-free property of multiqueue, we need nt = nthreads()
since all worker threads from all worker pools are possible enqueuers. We probably don't need to worry too much about this for now since most of enqueues happen inside each pool and a theoretical guarantee does not imply practical performance in general. But I think it's something to watch for when using it in practice especially when there are intensive inter-pool synchronizations. For example, a rather very conservative but wasteful approach is heap_p = max(heap_c * _nthreads_in_pool(tpid), nthreads() + 1)
.
@vtjnash Why is +=
used in heap_p += heap_c * nt
instead of =
by the way? We only need heap_p / nt > 1
here, right? The heap vector is either empy or fully initialized so I suppose it doesn't matter ATM though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. In particular, to retain interactivity, it should be a common pattern for an interactive task to enqueue work to the default threadpool. However, as you say, it may not be a problem in practice, and in any case we have many problems with the use of multi-queues in general; IMO the way forward is an evolution of your #43366.
FWIW I find it unclear that #42302 (comment) is addressed before merge |
Ah. @jpsamaroo and @JeffBezanson and I agreed that |
Is there somewhere a high level description what this PR offers? I originally thought that this PR allows the user to create new thread pools during runtime (which also the title indicates) but browsing through the documentation it seems we now have two thread pools both of which need to be specified before initialization. We implement a low-latency high-throughput data acquisition platform and it would be highly desirable to spawn a new thread pool during runtime. In particular we do not want to share a pool between the UI (Gtk.jl) thread and the data acquisition threads (we have multiple). A different question is what the implication of this PR for Gtk.jl are. Should the UI thread be run in the interactive pool? |
There's a new manual entry. This is intended to be used for low-latency, high interactivity tasks. It is also opt-in by the user. |
I read that manual entry but still do not see how to create new threadpools at runtime. The PR started with:
and I just wonder how the new threadpools are created. |
There might be a confusion here: This PR adds threadpool support "to the runtime", the Julia runtime that is. AFAIU, it does not add features to dynamically add threads (or threadpools) "at runtime" (i.e. similar to (FWIW, I'd also like the ability to dynamically add and remove Julia threads at runtime. But I understand that this might be tricky.) |
The PR was originally targeting: |
There have been a lot of iterations over the design and implementation. What's ended up in the documentation reflects what's implemented. As @carstenbauer said,
is not implemented. The size of each worker pool is fixed at the startup time. That said, given various cleanup patches Jameson has been adding, I suspect the size of the pools becomes dynamic at some point.
Yes, GUI packages should use |
Thanks for the clarification. Did not want to criticize this PR but wanted to find out where this feature has been gone.
I really hope that this will be the case soon. Right now the workaround is to write
which is not very user friendly. In my data acquisition application the isolation property of a dedicated thread pool is even more important. But anyway, the |
Hm, so there are only two pools now? I kind of had hoped that we could use this for the communication part in the REPL process in the VS Code extension, so that we can create our own thread pool for that which is completely separate from any thread pool that user code would be using, but that doesn't seem feasible with this design, right? If we run our communication tasks on the Just curious: was there a specific reason to now limit this to two pools, instead of just making it a flexible system that can support n pools? |
@davidanthoff I'm going to try to recall the exact discussion around this, but suffice to say, we spent quite a lot of time discussing that particular feature possibility. So one of the issues we were concerned about when discussing the possibility of user-defined (or even dynamically-added) threadpools is that, by allowing arbitrary threadpools to exist, we put a large burden on the ecosystem to ensure that threadpool usage retains performance composability. What is performance composability, in this context? Performance composability is the ability to combine multiple pieces of code - each of which is performant on their own - and still retain similar magnitudes of performance in the combined code (or, if you're lucky, get better performance with the two combined). And how can it be that extra threadpools hurt this? Say library A uses the default threadpool, and library B uses its own dedicated threadpool, both doing performance-sensitive operations. By default, the default threadpool uses all system threads, and so at full-tilt it can fully and near-optimally utilize the system's CPU resources. However, this other threadpool is also likely using multiple threads, probably the same number as we have in the default threadpool, and so when code is also executing on it at the same time as code executes on the default threadpool, suddenly the OS starts de-prioritizing some of our threads because it can't execute all of them in parallel. What's the result? Performance of both codes suffers. Because you were doing an unrelated computation in one threadpool, computations in another threadpool lose performance, and vice versa. So, is this state of affairs permanent? Not at all! We discussed the possibility of exposing more built-in threadpools in the future, as necessity dictates, and also the possibility to revisit adding user-defined/dynamic threadpool support if we find that it's really necessary, and with the assumption that we have some plan to retain performance composability. For now, let's try out the interactive threadpool and see how things go! If you run into problems with meeting your interactivity deadlines, then we should have a discussion with the relevant maintainers to see if we can fix that. |
I understand all of that, but I think that analysis just misses the scenario we would like to enable in the VS Code extension. The scenario we have is that we want to run a tiny bit of background code in a process that is primarily running user code, and we want our code to run as robustly not matter what kind of silly things the user is doing in the REPL process (or any of the other scenarios like notebooks, testing etc that we support). If the user during development makes a mistake and runs some code on the interactive thread pool that completely blocks things, then that is exactly the situation where we still want our code to be preemptively scheduled. I think the worries about overscheduling really don't apply to our scenario. We would only want to add one extra thread and it would do very little work. Plus, the scenario we are in is interactive editing in VS Code. In any typical interactive editing session, there are probably hundreds (if not thousands) of other threads competing for time from the OS (in VS Code itself, web browsers, Spotify, who knows what else). It is hard to see how adding one more thread would make any difference. Even if it did, I think it would be the right trade-off in this scenario. I had a brief chat with @vtjnash at Juliacon, and he mentioned that he was working on a PR that adds the ability to dynamically add threads, and from what he described that would probably allow us to do what we would like to do, so this might already be in the works. |
I completely understand that; the concern is not that most libraries will do bad things, but that some libraries would do what I warned against, for the purpose of winning in benchmarks or because of some ideological disagreement with Julia's runtime system. While that is harmless in and of itself, if users blindly enable a large additional threadpool (because they saw a blog post showing how to get max speed for some benchmark code, or even because the package docs mention it), then we'll end up getting issues filed when performance drops for complicated codes which incidentally use that library. I think what's missing here is that we don't have a form of preemptive multitasking which would allow an interactive or semi-real-time task to forcibly steal execution time from some other actively-executing task.
Can you give a brief explanation of what that code will be doing? It might help inform the direction of this discussion if I can make more concrete recommendations.
These threads aren't all necessarily utilizing 100% of each CPU on the system; if they were, you'd probably get very poor performance for every thread in the system. Instead, many are doing I/O or infrequently doing small bits of work, which is exactly the idea for the kinds of tasks that should use the interactive threadpool.
The work Jameson mentioned is probably #45447 , which allows non-Julia threads to call into Julia code, but not necessarily to be able to create more standard Julia threads from Julia code. However, that might benefit your use case by allowing you to execute some non-yielding Julia code from a manually-allocated thread (which will require some |
NOTE: New PR description is below at #42302 (comment)