Add threadpool support to runtime #42302

jpsamaroo · 2021-09-18T18:42:07Z

NOTE: New PR description is below at #42302 (comment)

tkf · 2021-09-18T18:58:52Z

If we go this way, I think we need to remove Threads.resize_nthreads! (added as a todo). Luckily, it looks like it's only used in LinearAlgebra now (it's great that RNG is now task local).

jpsamaroo · 2021-09-18T19:30:09Z

nthreads() never changes size for a given threadpool, so it's probably fine as-is?

jpsamaroo · 2021-09-18T20:05:24Z

~~What's up with these Floating point exceptions on BK? I'm not getting them locally.~~ Whoops, I didn't initialize the barrier large enough to include the main thread.

tkf · 2021-09-18T20:08:18Z

LinearAlgebra.__init__ allocates nthreads buffers

julia/stdlib/LinearAlgebra/src/LinearAlgebra.jl

Lines 575 to 576 in e50603f

    
           function __init__() 
        
               try

...

julia/stdlib/LinearAlgebra/src/LinearAlgebra.jl

Lines 589 to 591 in e50603f

    
           Threads.resize_nthreads!(Abuf) 
        
           Threads.resize_nthreads!(Bbuf) 
        
           Threads.resize_nthreads!(Cbuf)

so, I think the PR as-is breaks * called on non-primary threadpool (provided that __init__ is called on the primary threadpool).

This is also the "intended" usage pattern of resize_nthreads!

julia/base/threads.jl

Lines 16 to 23 in e50603f

    
               resize_nthreads!(A, copyvalue=A[1]) 
        
           Resize the array `A` to length [`nthreads()`](@ref).   Any new 
        
           elements that are allocated are initialized to `deepcopy(copyvalue)`, 
        
           where `copyvalue` defaults to `A[1]`. 
        
           This is typically used to allocate per-thread variables, and 
        
           should be called in `__init__` if `A` is a global constant.

That said, the use of Abuf etc.is questionable anyway

julia/stdlib/LinearAlgebra/src/matmul.jl

Lines 778 to 780 in e50603f

    
           # FIXME: This code is completely invalid!!! 
        
           Atile = unsafe_wrap(Array, convert(Ptr{T}, pointer(Abuf[Threads.threadid()])), sz) 
        
           Btile = unsafe_wrap(Array, convert(Ptr{S}, pointer(Bbuf[Threads.threadid()])), sz)

Maybe it's better to just get rid of the buffer pooling until we have a proper solution (e.g., "singleton" initialization API using the double checked locking pattern).

tkf · 2021-09-18T20:27:01Z

base/threadingconstructs.jl

+function spawn_threadpool(n::Int)
+    wq = [Base.StickyWorkqueue() for _ in 1:n]
+    push!(Base.AllWorkqueues, wq)


This push! is racy since it could resize the internal buffer of AllWorkqueues. We'd need a lock on AllWorkqueues or a proper concurrent collection or just a CAS loop on a ref.

Also, other things in jl_start_threads_ (called from here) like jl_threadpools++ are also racy. I guess we also need to update jl_all_tls_states here which is tricky (since GC can start anytime).

Actually, since the nthreads for this new pool needs to match in the Julia and C side, protecting AllWorkqueues is not enough. Maybe it's easier to just use a lock here and making sure to acquire it before every time jl_start_threads_dedicated is called. But synchronization of jl_all_tls_states still needs something else...

I think what we can do on the runtime side is to only (atomically) increment jl_threadpools once all the runtime and Julia pieces are fully setup, and have all arrays be fully sized (to the max number of threads/threadpools we'll allow, which can be a build-time parameter) at init time. I also agree that jl_start_threads_dedicated can take a lock to ensure serialization.

I think the fun(?) part is jl_all_tls_states. I'd imagine something like this (in C)

new = copy(jl_all_tls_states) new_ptls = [create_ptls() for _ in 1:n] append!(new, new_ptls) # No need for CAS since `jl_start_threads_dedicated` (or its caller) # acquires the lock @atomic jl_all_tls_states = new # *Then* make the new ptls available atomically by setting `jl_n_threads` @atomic jl_n_threads = length(new) # Here, we still can't free `old` because there can be # `jl_gc_wait_for_the_world` etc. GC.gc(false) # Once we run GC and stooped all existing thread, we can be sure there # are no stale accesses to `jl_all_tls_states` free(old)

...or maybe just leak old jl_all_tls_states to simply things.

Alternatively, we can use a linked list (of blocks) for jl_all_tls_states (but it maybe slows things down.)

@tkf now that threadpool spawning is done only at startup, would you mind re-reviewing? I'm not sure what kinds of atomics t->threadpool = xyz needs (in the runtime).

base/threadingconstructs.jl

tkf

I already raised concern in the "social" aspect in #41586 (comment) but from the technical standing point, the direction of this PR looks good to me.

I think the tricky parts that are not implemented yet are how to handle concurrent calls to jl_start_threads_dedicated and also how to make it correct with respect to concurrent calls to other internals like jl_gc_wait_for_the_world.

tkf · 2021-09-18T21:10:29Z

Also, maybe we need to make sure to call __init__ always on the primary pool? A package Foo can spawn things in Foo.__init__. But it's probably not appropriate to spawn them in a non-primary pool even when @eval using Foo is invoked in the non-primary pool?

DilumAluthge · 2021-09-19T21:40:15Z

cc: @chriselrod

chriselrod · 2021-09-19T21:46:05Z

cc: @chriselrod

If I understand correctly, this is what ThreadingUtilities/PolyesterWeave want.
ThreadingUtilities will be able to create its own independent thread pool.
Other threaded code will not start tasks on any threads in this pool, removing the concern for potential conflicts.

But I imagine over prescribing threads will still be a concern?
You can of course also start Julia with julia -t1000 and then Threads.@threads will already over prescribe, but if a bunch of libraries start their own pools, it may be easier to accidentally end up with more threads than Sys.CPU_THREADS.

DilumAluthge · 2021-09-19T21:53:41Z

but if a bunch of libraries start their own pools, it may be easier to accidentally end up with more threads than Sys.CPU_THREADS

This is a major concern for me as well.

jpsamaroo · 2021-09-19T21:59:46Z

My reason for implementing this feature in the runtime is focused on I/O, not on computation. Only libraries which need these isolated threadpools to prevent deadlocks (like CUDA.jl/AMDGPU.jl for hostcalls, or Gtk.jl for its event loop) should use this feature by default; other libraries could allow this as an opt-in feature, but I would recommend against it.

Another feature that we will need that should better suit alternative task scheduler implementations would be the ability to forward task scheduling decisions to a library instead of the built-in PARTR algorithm.

tkf · 2021-09-19T22:34:33Z

Yes, I want to echo Julian and this feature should only be used for latency-critical (I/O) parts of the programs, and not throughput-oriented ones.

I cannot emphasize enough that this PR relies heavily on the cooperation from the community to not use this feature as much as possible, especially compute-heavy ones. If every package uses its own thread pool, we'll lose composability.

If a package uses its own thread pool, sure, it makes its micro-benchmark look awesome. But it does not mean it performs well when you combine multiple packages together. (But please don't take it as a direct criticism to Polyester.jl et al. I think it's great that we are working on multiple fronts to improve parallelism in Julia. But I think it'd be better to try converging to single resource management eventually so that multiple components in the ecosystem work well together.)

chriselrod · 2021-09-20T04:50:03Z

Yes, I want to echo Julian and this feature should only be used for latency-critical (I/O) parts of the programs, and not throughput-oriented ones.

I cannot emphasize enough that this PR relies heavily on the cooperation from the community to not use this feature as much as possible, especially compute-heavy ones. If every package uses its own thread pool, we'll lose composability.

If a package uses its own thread pool, sure, it makes its micro-benchmark look awesome. But it does not mean it performs well when you combine multiple packages together. (But please don't take it as a direct criticism to Polyester.jl et al. I think it's great that we are working on multiple fronts to improve parallelism in Julia. But I think it'd be better to try converging to single resource management eventually so that multiple components in the ecosystem work well together.)

I agree that "working well together" is preferable, but an independent thread pool (like BLAS currently uses) is preferable to potentially causing problems, and will be a much simpler adjustment to make.

On this end:

Another feature that we will need that should better suit alternative task scheduler implementations would be the ability to forward task scheduling decisions to a library instead of the built-in PARTR algorithm.

I'd be happy to have some API for communicating with the PARTR scheduler. A fairly good and perhaps simple soltuion would be to have PolyesterWeave simply avoid launching threads at all if there are multiple tasks scheduled by PARTR, with the assumption here being that there's a courser grained level of parallelism that will benefit more.
It's not perfect, as it could lead to under utilization if the parallelism is limited at that scope.

tkf · 2021-09-20T08:45:38Z

I think the philosophy behind Julia's parallelism design is that the BLAS model of managing computing resources at the library level doesn't work well when embedded in a large parallel program. I think beating BLAS using a composable way would be much more exciting than beating BLAS using the BLAS way. (Even though the latter is an excellent achievement on its own.)

Julia's parallelism is meant to go beyond per-library static scheduling.

vchuravy · 2021-09-20T10:47:48Z

There also has been the goal from the beginning to be able to replace BLAS thread pool and make it use the Julia tasks (as FFTW does) to enable composability and to avoid oversubscription.

KristofferC · 2021-09-20T10:48:04Z

W.r.t BLAS: I thought the goal was to allow OpenBLAS to use our scheduler. Basically, we should push for libraries to have an API that allows passing them a scheduler, not passing over scheduling decisions to them.

Edit: Jinx.

chriselrod · 2021-09-20T11:15:16Z

I think the philosophy behind Julia's parallelism design is that the BLAS model of managing computing resources at the library level doesn't work well when embedded in a large parallel program. I think beating BLAS using a composable way would be much more exciting than beating BLAS using the BLAS way. (Even though the latter is an excellent achievement on its own.)

Julia's parallelism is meant to go beyond per-library static scheduling.

There also has been the goal from the beginning to be able to replace BLAS thread pool and make it use the Julia tasks (as FFTW does) to enable composability and to avoid oversubscription.

W.r.t BLAS: I thought the goal was to allow OpenBLAS to use our scheduler. Basically, we should push for libraries to have an API that allows passing them a scheduler, not passing over scheduling decisions to them.

OpenMathLib/OpenBLAS#2255

Anyway, OpenBLAS's single and especially multithreaded performance are pretty bad, with a few exceptions.
So libraries interested in maximizing performance, like DifferentialEquations.jl, will want to continue to use faster Julia implementations like https://github.com/YingboMa/RecursiveFactorization.jl
Also, I would generally prefer LAPACK routines to be single threaded when nested in multithreaded programs / follow a breadth-first scheduling algorithm, as these have complicated dependencies such that the returns on number of threads is bad -- much worse than can easily be achieved from courser parallelism -- except at asymptotic sizes.

OpenBLAS's LU factorization is actively slower multithreaded than single threaded unless the matrices are very large on recent CPUs with lots of threads. And that is the case with OpenBLAS's current threading, which is lower overhead than Julia's.
RecursiveFactorization.jl is able to actually achieve some speedup from multiple threads instead of slowing down (on top of being faster single threaded than OpenBLAS), but without an algorithmic change that speed up is going to be small enough to not be worthwhile if those threads can be doing anything else.
Even if we achieve MKL level benefits from multithreading via an algorithmic change, the threads would probably be better used on other "outer" tasks, so that disabling threads is probably a good heuristic solution for algorithms like this in general.

Thus, IMO, having BLAS compose with Julia's threading is a worse solution than BLAS.set_num_threads(1).
A better solution is to automatically use a single thread then other mutltihreading is engaged.

jpsamaroo · 2021-09-20T13:38:29Z

Regardless of whether a given BLAS routine is maximally efficient in a multithreaded setup, trying to execute other code concurrently on the main threadpool with an external threadpool will harm the efficiency of both pools by cache thrashing. This is something that PARTR has the ability to avoid by co-scheduling related tasks together (such as the tasks comprising a BLAS call) and excluding unrelated tasks temporarily. If we're seeing evidence that PARTR isn't doing that job effectively for Julia code, we should fix PARTR, instead of giving up and using threadpools to make just one or two libraries work most efficiently in isolation.

chriselrod · 2021-09-20T13:47:34Z

Regardless of whether a given BLAS routine is maximally efficient in a multithreaded setup, trying to execute other code concurrently on the main threadpool with an external threadpool will harm the efficiency of both pools by cache thrashing. This is something that PARTR has the ability to avoid by co-scheduling related tasks together (such as the tasks comprising a BLAS call) and excluding unrelated tasks temporarily.

As I said before, because of the lack of efficiency of many BLAS/LAPACK routines, BLAS.set_num_threads(1) to avoid thrashing will generally be faster than "co-scheduling": the non-BLAS threading will be more efficient, and thus the parallelism at that level should be maximized and the threading at the BLAS level minimized to optimize performance.

This is why my nearly ideal solution is to be able to query the PARTR scheduler if it's currently scheduling multiple tasks. If so, PolyesterWeave could avoid allocating any threads at all, allowing it to be multithreaded in otherwise serial execution, but single threaded in otherwise parallel execution -- following the assumption that the threading on the other level will be more efficient.

If we're seeing evidence that PARTR isn't doing that job effectively for Julia code, we should fix PARTR, instead of giving up and using threadpools to make just one or two libraries work most efficiently in isolation.

How many years will that take?

However, I'm also not a fan of depth-first scheduling anyway and have low expectations.

Between expecting bad results in a matter of years and satisfactory results that also give me control over performance in a time-frame of weeks at most (or immediately, given that I've already written ThreadingUtilties, etc), my choices are obvious.

To be clear:
"In production", I'm use ThreadsX and Threads.@threads more often than I use Polyester or @tturbo.
Because I do have control of the full code stack, I "manually" implement what I want here by simply not using threading at all.
That code also always has only 1 layer of threading.

But it would help libraries like RecursiveFactorization.jl, important for solving stiff ODEs, to have this autoswitching ability.

jpsamaroo · 2021-09-20T14:04:27Z

This is why my nearly ideal solution is to be able to query the PARTR scheduler if it's currently scheduling multiple tasks.

Ok, we can probably make that happen, by exposing a C function that returns the size of the multiq heaps, or you might just be able to check Base.Workqueues. If you had that available, would you still plan to use threadpools?

How many years will that take?

Probably on the order of a month or two, if we have a good MWE.

However, I'm also not a fan of depth-first scheduling anyway and have low expectations.

Care to explain why?

chriselrod · 2021-09-20T15:03:15Z

Ok, we can probably make that happen, by exposing a C function that returns the size of the multiq heaps, or you might just be able to check Base.Workqueues.

That'd be great.
How can I check Base.Workques? length doesn't seem right:

julia> function workqueuelen(iter)
           res = Matrix{Int}(undef, Threads.nthreads()+1, iter)
           Threads.@threads for i ∈ 1:iter
               for j ∈ 1:Threads.nthreads()
                   res[j,i] = length(Base.Workqueues[j])
               end
               res[Threads.nthreads()+1,i] = length(Base.Workqueue)
           end
           res
       end
workqueuelen (generic function with 2 methods)

julia> wqls2 = workqueuelen(1_000_000);

julia> count(!iszero, wqls2)
575323

julia> sum(wqls2)
575323

julia> sum(wqls2) / length(wqls2)
0.019838724137931033

I.e., it is mostly 0 here. I'd have to look more at the implementation details to find out what to look for/what it's doing. I.e., whether a task pushed into the queue is popped off as soon as it is scheduled or once it is finished. My guess is the former.

If you had that available, would you still plan to use threadpools?

Yeah, if I had that available I could use it to avoid conflicts instead of threadpools and make everyone in this thread happy. ;)

jpsamaroo · 2021-09-20T15:21:04Z

I think that's somewhat expected, since tasks can also be placed in the PARTR multiq for DFS.

Yeah, if I had that available I could use it to avoid conflicts instead of threadpools and make everyone in this thread happy. ;)

Cool, then I'll put together a PR in the next week or so to expose multiq sizes.

tkf · 2021-09-20T20:33:02Z

Querying the scheduler state is not a great API. It's racy. Also, in theory, it can increase cache traffic and move otherwise thread-local states out of exclusive mode (although the work queue we currently use probably isn't designed to get beneficial effects from this). More importantly, the user is now responsible for making sure the result is identical no matter what the scheduler state is. Using such an API is as hard as using return_type. It's doable but it needs a big warning saying that the user needs to make sure the result is identical bit-by-bit. (Or a library function using it can declare that the behavior is non-deterministic. But it's really hard to program against such a library.)

Querying the global state is not the best mindset in concurrent programming, IMHO. Each component should act with mostly local knowledge with minimal synchronizations. The point is designing the system that self-organizes into efficient machinery when things are put together.

tknopp · 2021-09-22T09:40:21Z

will this PR also allow for changing the amount of threads in the first thread pool? This would be a great feature and also would prevent that people dynamically create new thread pools just because the size of the first one cannot be changed.

You mentioned Gtk.jl as a use case. Would be very interesting if this could help resolving this issue: JuliaGraphics/Gtk.jl#503

jpsamaroo · 2021-09-22T13:11:06Z

will this PR also allow for changing the amount of threads in the first thread pool? This would be a great feature and also would prevent that people dynamically create new thread pools just because the size of the first one cannot be changed.

No, that's intentionally a non-goal because it has the potential to break existing code that relies on Threads.nthreads() remaining constant (which this PR guarantees for code running within only a single threadpool). I would hope that users never directly use this feature, and only a few libraries use it very sparingly. The main threadpool is what users should be using, and they're still expected to size it appropriately ahead of time.

You mentioned Gtk.jl as a use case. Would be very interesting if this could help resolving this issue: JuliaGraphics/Gtk.jl#503

Yeah, this should be solvable by placing the Gtk event loop into its own threadpool, where it can execute for 100% of its thread's time slice if desired.

Adds support for Julia to be started with `--threads=auto|N[,M]` where `N` specifies the number of threads in the default threadpool and `M`, if provided, specifies the number of threads in the new interactive threadpool. Adds an optional first parameter to `Threads.@spawn`: `[:default|:interactive]`. If `:interactive` is specified, the task will be run by thread(s) in the interactive threadpool only (if there is one). Co-authored-by: K Pamnany <kpamnany@users.noreply.github.com>

tkf · 2022-04-26T03:45:13Z

base/partr.jl

+        nt = UInt32(Threads._nthreads_in_pool(tpid))
        if heap_c * nt <= heap_p
            return heap_p
        end

        heap_p += heap_c * nt


For the wait-free property of multiqueue, we need nt = nthreads() since all worker threads from all worker pools are possible enqueuers. We probably don't need to worry too much about this for now since most of enqueues happen inside each pool and a theoretical guarantee does not imply practical performance in general. But I think it's something to watch for when using it in practice especially when there are intensive inter-pool synchronizations. For example, a rather very conservative but wasteful approach is heap_p = max(heap_c * _nthreads_in_pool(tpid), nthreads() + 1).

@vtjnash Why is += used in heap_p += heap_c * nt instead of = by the way? We only need heap_p / nt > 1 here, right? The heap vector is either empy or fully initialized so I suppose it doesn't matter ATM though.

It's a good point. In particular, to retain interactivity, it should be a common pattern for an interactive task to enqueue work to the default threadpool. However, as you say, it may not be a problem in practice, and in any case we have many problems with the use of multi-queues in general; IMO the way forward is an evolution of your #43366.

tkf · 2022-04-27T10:09:54Z

FWIW I find it unclear that #42302 (comment) is addressed before merge

kpamnany · 2022-04-27T11:36:55Z

FWIW I find it unclear that #42302 (comment) is addressed before merge

Ah. @jpsamaroo and @JeffBezanson and I agreed that :interactive is the best of the options.

tknopp · 2022-04-30T09:19:52Z

Is there somewhere a high level description what this PR offers? I originally thought that this PR allows the user to create new thread pools during runtime (which also the title indicates) but browsing through the documentation it seems we now have two thread pools both of which need to be specified before initialization. We implement a low-latency high-throughput data acquisition platform and it would be highly desirable to spawn a new thread pool during runtime. In particular we do not want to share a pool between the UI (Gtk.jl) thread and the data acquisition threads (we have multiple).

A different question is what the implication of this PR for Gtk.jl are. Should the UI thread be run in the interactive pool?

kpamnany · 2022-05-01T01:05:24Z

There's a new manual entry.

This is intended to be used for low-latency, high interactivity tasks. It is also opt-in by the user.

tknopp · 2022-05-01T09:54:31Z

I read that manual entry but still do not see how to create new threadpools at runtime. The PR started with:

This PR implements threadpool support in the runtime, allowing extra threadpools of arbitrary sizes to be created at runtime.

and I just wonder how the new threadpools are created.

carstenbauer · 2022-05-01T11:04:16Z

There might be a confusion here: This PR adds threadpool support "to the runtime", the Julia runtime that is. AFAIU, it does not add features to dynamically add threads (or threadpools) "at runtime" (i.e. similar to addprocs for processes).

(FWIW, I'd also like the ability to dynamically add and remove Julia threads at runtime. But I understand that this might be tricky.)

tknopp · 2022-05-01T11:15:01Z

The PR was originally targeting:
"allowing extra threadpools of arbitrary sizes to be created at runtime"
Not sure what I am confusing here.

tkf · 2022-05-01T15:26:55Z

There have been a lot of iterations over the design and implementation. What's ended up in the documentation reflects what's implemented.

As @carstenbauer said,

"allowing extra threadpools of arbitrary sizes to be created at runtime"

is not implemented. The size of each worker pool is fixed at the startup time. That said, given various cleanup patches Jameson has been adding, I suspect the size of the pools becomes dynamic at some point.

Should the UI thread be run in the interactive pool?

Yes, GUI packages should use @spawn :interactive. This does not mean UI libraries have exclusive access to OS threads. If the Julia users (i.e., whoever starts the julia process) use appropriate options, these tasks are scheduled in a worker pool that is separated from the one used by normal @spawn. Furthermore, since the :interactive pool is shared across all packages, UI libraries still have to work cooperatively and programmed carefully to always delegate computation to the main pool.

tknopp · 2022-05-01T16:01:55Z

Thanks for the clarification. Did not want to criticize this PR but wanted to find out where this feature has been gone.

I suspect the size of the pools becomes dynamic at some point.

I really hope that this will be the case soon. Right now the workaround is to write __init__ functions looking like this

function __init__()
  if Threads.nthreads() < 4
    error("MyPackage requires four threads. Start Julia with `julia -t 4`.")
  end
end

which is not very user friendly. In my data acquisition application the isolation property of a dedicated thread pool is even more important. But anyway, the @spawn :interactive sounds like an important step for Gtk.jl. Looking forward to that.

davidanthoff · 2022-07-21T03:48:03Z

Hm, so there are only two pools now? I kind of had hoped that we could use this for the communication part in the REPL process in the VS Code extension, so that we can create our own thread pool for that which is completely separate from any thread pool that user code would be using, but that doesn't seem feasible with this design, right? If we run our communication tasks on the interactive thread pool, and then user code blocks that, then our communication is also blocked.

Just curious: was there a specific reason to now limit this to two pools, instead of just making it a flexible system that can support n pools?

jpsamaroo · 2022-08-03T12:19:03Z

@davidanthoff I'm going to try to recall the exact discussion around this, but suffice to say, we spent quite a lot of time discussing that particular feature possibility.

So one of the issues we were concerned about when discussing the possibility of user-defined (or even dynamically-added) threadpools is that, by allowing arbitrary threadpools to exist, we put a large burden on the ecosystem to ensure that threadpool usage retains performance composability.

What is performance composability, in this context? Performance composability is the ability to combine multiple pieces of code - each of which is performant on their own - and still retain similar magnitudes of performance in the combined code (or, if you're lucky, get better performance with the two combined).

And how can it be that extra threadpools hurt this? Say library A uses the default threadpool, and library B uses its own dedicated threadpool, both doing performance-sensitive operations. By default, the default threadpool uses all system threads, and so at full-tilt it can fully and near-optimally utilize the system's CPU resources. However, this other threadpool is also likely using multiple threads, probably the same number as we have in the default threadpool, and so when code is also executing on it at the same time as code executes on the default threadpool, suddenly the OS starts de-prioritizing some of our threads because it can't execute all of them in parallel. What's the result? Performance of both codes suffers. Because you were doing an unrelated computation in one threadpool, computations in another threadpool lose performance, and vice versa.

So, is this state of affairs permanent? Not at all! We discussed the possibility of exposing more built-in threadpools in the future, as necessity dictates, and also the possibility to revisit adding user-defined/dynamic threadpool support if we find that it's really necessary, and with the assumption that we have some plan to retain performance composability. For now, let's try out the interactive threadpool and see how things go! If you run into problems with meeting your interactivity deadlines, then we should have a discussion with the relevant maintainers to see if we can fix that.

davidanthoff · 2022-08-06T06:06:55Z

I understand all of that, but I think that analysis just misses the scenario we would like to enable in the VS Code extension.

The scenario we have is that we want to run a tiny bit of background code in a process that is primarily running user code, and we want our code to run as robustly not matter what kind of silly things the user is doing in the REPL process (or any of the other scenarios like notebooks, testing etc that we support). If the user during development makes a mistake and runs some code on the interactive thread pool that completely blocks things, then that is exactly the situation where we still want our code to be preemptively scheduled.

I think the worries about overscheduling really don't apply to our scenario. We would only want to add one extra thread and it would do very little work. Plus, the scenario we are in is interactive editing in VS Code. In any typical interactive editing session, there are probably hundreds (if not thousands) of other threads competing for time from the OS (in VS Code itself, web browsers, Spotify, who knows what else). It is hard to see how adding one more thread would make any difference. Even if it did, I think it would be the right trade-off in this scenario.

I had a brief chat with @vtjnash at Juliacon, and he mentioned that he was working on a PR that adds the ability to dynamically add threads, and from what he described that would probably allow us to do what we would like to do, so this might already be in the works.

jpsamaroo · 2022-08-10T19:20:22Z

I completely understand that; the concern is not that most libraries will do bad things, but that some libraries would do what I warned against, for the purpose of winning in benchmarks or because of some ideological disagreement with Julia's runtime system.

While that is harmless in and of itself, if users blindly enable a large additional threadpool (because they saw a blog post showing how to get max speed for some benchmark code, or even because the package docs mention it), then we'll end up getting issues filed when performance drops for complicated codes which incidentally use that library.

I think what's missing here is that we don't have a form of preemptive multitasking which would allow an interactive or semi-real-time task to forcibly steal execution time from some other actively-executing task.

we want our code to run as robustly not matter what kind of silly things the user is doing in the REPL process

Can you give a brief explanation of what that code will be doing? It might help inform the direction of this discussion if I can make more concrete recommendations.

In any typical interactive editing session, there are probably hundreds (if not thousands) of other threads competing for time from the OS (in VS Code itself, web browsers, Spotify, who knows what else). It is hard to see how adding one more thread would make any difference. Even if it did, I think it would be the right trade-off in this scenario.

These threads aren't all necessarily utilizing 100% of each CPU on the system; if they were, you'd probably get very poor performance for every thread in the system. Instead, many are doing I/O or infrequently doing small bits of work, which is exactly the idea for the kinds of tasks that should use the interactive threadpool.

I had a brief chat with @vtjnash at Juliacon, and he mentioned that he was working on a PR that adds the ability to dynamically add threads, and from what he described that would probably allow us to do what we would like to do, so this might already be in the works.

The work Jameson mentioned is probably #45447 , which allows non-Julia threads to call into Julia code, but not necessarily to be able to create more standard Julia threads from Julia code. However, that might benefit your use case by allowing you to execute some non-yielding Julia code from a manually-allocated thread (which will require some ccalls to allocate and configure a thread from the OS).

jpsamaroo mentioned this pull request Sep 18, 2021

Severe thread starvation issues #41586

Closed

jpsamaroo force-pushed the jps/moar-threads branch from b060dba to 943d4fb Compare September 18, 2021 19:37

tkf added the multithreading Base.Threads and related functionality label Sep 18, 2021

tkf reviewed Sep 18, 2021

View reviewed changes

base/threadingconstructs.jl Outdated Show resolved Hide resolved

tkf reviewed Sep 18, 2021

View reviewed changes

tkf mentioned this pull request Sep 19, 2021

Remove buffer pool for tiled GEMM #42309

Merged

kpamnany force-pushed the jps/moar-threads branch from ccfa09c to 411077c Compare April 22, 2022 18:47

kpamnany changed the title ~~[RFC] Add threadpool support to runtime~~ Add threadpool support to runtime Apr 25, 2022

kpamnany force-pushed the jps/moar-threads branch 2 times, most recently from 11637f1 to f58cf21 Compare April 25, 2022 16:11

kpamnany force-pushed the jps/moar-threads branch from f58cf21 to c7ea709 Compare April 25, 2022 18:48

tkf reviewed Apr 26, 2022

View reviewed changes

Merge branch 'master' into jps/moar-threads

ab78165

kpamnany merged commit f410c62 into JuliaLang:master Apr 26, 2022

kpamnany deleted the jps/moar-threads branch April 26, 2022 20:47

ViralBShah added the needs news A NEWS entry is required for this change label Apr 26, 2022

benlorenz mentioned this pull request Apr 27, 2022

GC crash while allocating GAP object oscar-system/Oscar.jl#1283

Closed

kpamnany mentioned this pull request Apr 27, 2022

Add NEWS entry for threadpools #45107

Merged

JeffBezanson removed the needs news A NEWS entry is required for this change label Nov 10, 2022

fonsp mentioned this pull request Nov 29, 2022

serve() fails when run from an :interactive thread (Julia >=1.9) JuliaWeb/HTTP.jl#970

Closed

nickrobinson251 mentioned this pull request Mar 9, 2023

Docs on JULIA_NUM_THREADS env var should mention threadpools #48960

Closed

NHDaly mentioned this pull request Mar 14, 2023

Feature request: prevent spawning to thread 1 #34267

Closed

Add threadpool support to runtime #42302

Add threadpool support to runtime #42302

Conversation

jpsamaroo commented Sep 18, 2021 • edited Loading

tkf commented Sep 18, 2021

jpsamaroo commented Sep 18, 2021

jpsamaroo commented Sep 18, 2021 • edited Loading

tkf commented Sep 18, 2021 • edited Loading

tkf Sep 18, 2021 • edited Loading

Choose a reason for hiding this comment

tkf Sep 18, 2021

Choose a reason for hiding this comment

jpsamaroo Sep 19, 2021

Choose a reason for hiding this comment

tkf Sep 19, 2021 • edited Loading

Choose a reason for hiding this comment

jpsamaroo Jan 13, 2022

Choose a reason for hiding this comment

tkf left a comment • edited Loading

Choose a reason for hiding this comment

tkf commented Sep 18, 2021

DilumAluthge commented Sep 19, 2021

chriselrod commented Sep 19, 2021

DilumAluthge commented Sep 19, 2021

jpsamaroo commented Sep 19, 2021

tkf commented Sep 19, 2021

chriselrod commented Sep 20, 2021 • edited Loading

tkf commented Sep 20, 2021

vchuravy commented Sep 20, 2021

KristofferC commented Sep 20, 2021 • edited Loading

chriselrod commented Sep 20, 2021 • edited Loading

jpsamaroo commented Sep 20, 2021

chriselrod commented Sep 20, 2021 • edited Loading

jpsamaroo commented Sep 20, 2021

chriselrod commented Sep 20, 2021

jpsamaroo commented Sep 20, 2021

tkf commented Sep 20, 2021 • edited Loading

tknopp commented Sep 22, 2021

jpsamaroo commented Sep 22, 2021

tkf Apr 26, 2022

Choose a reason for hiding this comment

kpamnany Apr 26, 2022

Choose a reason for hiding this comment

tkf commented Apr 27, 2022

kpamnany commented Apr 27, 2022 • edited Loading

tknopp commented Apr 30, 2022

kpamnany commented May 1, 2022

tknopp commented May 1, 2022

carstenbauer commented May 1, 2022 • edited Loading

tknopp commented May 1, 2022

tkf commented May 1, 2022

tknopp commented May 1, 2022

davidanthoff commented Jul 21, 2022 • edited Loading

jpsamaroo commented Aug 3, 2022

davidanthoff commented Aug 6, 2022

jpsamaroo commented Aug 10, 2022

jpsamaroo commented Sep 18, 2021 •

edited

Loading

jpsamaroo commented Sep 18, 2021 •

edited

Loading

tkf commented Sep 18, 2021 •

edited

Loading

tkf Sep 18, 2021 •

edited

Loading

tkf Sep 19, 2021 •

edited

Loading

tkf left a comment •

edited

Loading

chriselrod commented Sep 20, 2021 •

edited

Loading

KristofferC commented Sep 20, 2021 •

edited

Loading

chriselrod commented Sep 20, 2021 •

edited

Loading

chriselrod commented Sep 20, 2021 •

edited

Loading

tkf commented Sep 20, 2021 •

edited

Loading

kpamnany commented Apr 27, 2022 •

edited

Loading

carstenbauer commented May 1, 2022 •

edited

Loading

davidanthoff commented Jul 21, 2022 •

edited

Loading