-
-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design: thread pools #6
Comments
The Java SDK docs have surprisingly good notes on these trade-offs: java.util.concurrent.ThreadPoolExecutor |
Why the CLR 2.0 SP1's threadpool default max thread count was increased to 250/CPU (Not sure what the difference between the latter two papers is) |
|
@arigo points out something disturbing about Python threads (on both CPython and PyPy): I think the practical impact of this problem is very low. In particular, 64-bit systems have the address space to handle an almost arbitrary number of threads, and even on 32-bit systems, if our default thread limit is on the order of "dozens", then it would take a lot of threads simultaneously stalled in the last bit of exiting to cause a problem. Unless the heap was taking up most of the address space, I guess. It makes me a bit nervous that it's the sort of thing that someone will find a way to hit under high load though. One response to this would be to use a thread pool. We would want an unbounded thread pool, I think, i.e. one where if a job comes in and all threads are busy, we spawn a new thread (and if a thread is idle for too long, it goes away). (So that's an interesting control problem to avoid too much churn.) So long as worker threads marked themselves as "soon to be available" before signaling back to the trio thread, this would avoid the race condition in the steady state case by making sure that we never have more threads running than were allowed at some point by the trio-side limiting within the last N seconds. In principle it though would still be possible to hit the race condition by having exactly the wrong sort of cyclic workload, where we spin up N threads, then they all sit idle long enough to start running their cleanup logic, and while they're still doing that we suddenly have to spin up another N threads to replace them, so we end up with 2*N threads running at once. To solve this really-properly-for-sure, we would need to stop using Python's threading module to spawn and join threads. Unfortunately, it's not quite as simple as using cffi to wrap Given our unbounded thread pool semantics, it makes sense to use a single thread pool for the whole process (shared across subinterpreters and threads), which might simplify things, or make them more complicated. It's not mandatory. |
- New synchronization primitive: CapacityLimiter. Like a Semaphore but more specialized. See python-triogh-182 for rationale. - Add limiter= argument to run_in_worker_thread, that allows one to correctly (modulo python-trio#6 (comment)) control the number of active threads. - Added new function current_default_worker_thread_limiter(), which creates or returns a run-local CapacityLimiter, and made run_in_worker_thread use it by default when no other limiter= is given. Closes: python-triogh-10, python-triogh-57, python-triogh-156
- New synchronization primitive: CapacityLimiter. Like a Semaphore but more specialized. See python-triogh-182 for rationale. - Add limiter= argument to run_in_worker_thread, that allows one to correctly (modulo python-trio#6 (comment)) control the number of active threads. - Added new function current_default_worker_thread_limiter(), which creates or returns a run-local CapacityLimiter, and made run_in_worker_thread use it by default when no other limiter= is given. Closes: python-triogh-10, python-triogh-57, python-triogh-156
So with #181 we've pretty much settled on our general architecture for handling worker threads: a lower-level unbounded "thread cache" (similar to the JDK's "cached thread pool"), plus an extensible policy layer on top that runs in the trio thread. So the remaining issue is: currently our "thread cache"'s replacement policy is "always", i.e., we don't actually have a cache, we just start a new thread every time. Maybe it would be worthwhile to re-use threads. This is a non-trivial increase in complexity, and it's primarily an optimization, so maybe it should wait until we have some real programs whose behavior we can measure. If/when we do this, we'll need to figure out the API for interacting with the cache. At the least, we'll need a We'll also want to re-use this "thread cache" for other miscellaneous threads that can't quite use the standard |
For some reason a plausible algorithm for this popped in my head today: import threading
import outcome
import queue
try:
# SimpleQueue is faster, but only available on python 3.7+
from queue import SimpleQueue
except ImportError:
from queue import Queue as SimpleQueue
# How long a thread will idle waiting for new work before it exits. I don't
# think it should matter too much, though it should be substantially longer
# than the cost of creating a thread, which is on the order of 10-100 µs
IDLE_TIMEOUT = 10 # seconds
class ThreadCache:
def __init__(self):
self._idle_workers = 0
self._total_workers = 0
self._lock = threading.Lock()
self._q = SimpleQueue()
def _worker(self):
while True:
try:
job = self._q.get(timeout=IDLE_TIMEOUT)
except queue.Empty:
with self._lock:
if self._idle_workers == 0:
# We were *just* assigned some work, so loop back
# around to get it
continue
else:
self._idle_workers -= 1
self._total_workers -= 1
return
fn, deliver = job
result = outcome.capture(fn)
with self._lock:
self._idle_workers += 1
deliver(result)
def submit(self, fn, deliver):
with self._lock:
if self._idle_workers == 0:
# Spawn a new worker.
threading.Thread(target=self._worker, daemon=True).start()
self._total_workers += 1
else:
self._idle_workers -= 1
self._q.put((fn, deliver)) I think that's correct. It's deceptively simple. This is designed to work as a process-global singleton, so if you have multiple Tracking This is kind of interaction is also related to why we have the
In the design above, the thread marks itself as idle before reporting back to the main thread, so if There is still a tiny race condition where we can briefly end up with more than 10 threads: in the moment between when an idle thread decides to give up and quit, and when it actually does so, an 11 thread could be spawned. That situation only persists for a tiny fraction of a second though before correcting itself, while the Hmm, in fact if you're unlucky, it could let you exceed the threshold forever... our simple thread cleanup scheme above isn't actually guaranteed to converge on the right number of threads. Imagine you have 1 job submitted per second, and it completes ~instantaneously. So you really only need 1 thread to handle all the jobs. But let's say for whatever reason, you have 10 threads, and they're all waiting on the same queue for jobs to be assigned. (Maybe you briefly needed 10 threads a while ago, but don't anymore.) If you're unlucky, the jobs might be assigned to the threads round-robin style, so thread 1 handles the first job, thread 2 handles the second job, etc. This means that every thread ends up handling 1 request every 10 seconds. So if your idle timeout is 10 seconds... no thread is ever idle that long, and they all stay alive, even though 9 of them are superfluous. Some ideas for solving this:
(You can also get much more fancy with controller design to adjust thread cache size. For example, see Optimizing Concurrency Levels in the .NET ThreadPool: A Case Study of Controller Design and Implementation. One obvious addition would be to add some hysteresis, to smooth out the thread pool size instead of letting all the threads exit at once.) |
Here's another version, that's about 2x faster than the one I put above, and assigns work to threads in LIFO style, so idle timeouts will work properly and it has better cache behavior: import threading
import sys
# TODO: also use dict on pypy
# Note: we need an ordered dict that's thread-safe (assignment, del, and
# popitem should all be atomic wrt each other). Fortunately, dict is always
# thread-safe and on py35+, OrderedDict is also thread-safe (protected by the
# GIL).
if sys.version_info >= (3, 7):
odict = dict
else:
from collections import OrderedDict as odict
import outcome
# How long a thread will idle waiting for new work before it exits. I don't
# think it should matter too much, though it should be substantially longer
# than the cost of creating a thread, which is on the order of 10-100 µs
IDLE_TIMEOUT = 10 # seconds
class WorkerThread:
def __init__(self, thread_cache):
self._job = None
self._thread_cache = thread_cache
# Weird convention for this lock: "unlocked" means we've been assigned a job
# Initially we have no job, so it starts out in locked state.
self._worker_lock = threading.Lock()
self._worker_lock.acquire()
thread = threading.Thread(target=self._work, daemon=True)
thread.start()
def _work(self):
while True:
if self._worker_lock.acquire(timeout=IDLE_TIMEOUT):
# We got a job
fn, deliver = self._job
self._job = None
result = outcome.capture(fn)
self._thread_cache._idle_workers[self] = None
deliver(result)
else:
# Timeout acquiring lock, so we can probably exit
try:
del self._thread_cache._idle_workers[self]
except KeyError:
# We're being assigned a job, so we can't exit yet
continue
else:
# We successfully removed ourselves from the idle
# worker queue, so we can exit
return
class ThreadCache:
def __init__(self):
self._idle_workers = odict()
self._cache_lock = threading.Lock()
def submit(self, fn, deliver):
try:
worker, _ = self._idle_workers.popitem()
except KeyError:
worker = WorkerThread(self)
worker._job = (fn, deliver)
worker._worker_lock.release() On my Linux laptop with CPython 3.7.3, I get: In [58]: a, b = socket.socketpair()
In [59]: %timeit tc.submit(lambda: None, lambda _: a.send(b"x")); b.recv(1)
8.48 µs ± 312 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [60]: %timeit outcome.capture(lambda: None); (lambda _: a.send(b"x"))(None); b.recv(1)
2.63 µs ± 67.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [63]: %timeit threading.Thread(target=lambda: a.send(b"x")).start(); b.recv(1)
79.5 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) So pushing a job into a warm thread pool adds about 6 µs of overhead versus running it in the main thread, and is ~10x faster than spawning a thread (like we do now). I also ran |
On my Linux laptop, this makes 'await trio.to_thread.run_sync(lambda: None)' about twice as fast, from ~150 µs to ~75 µs. Closes: python-triogh-6 Test program: import trio import time COUNT = 10000 async def main(): while True: start = time.monotonic() for _ in range(COUNT): await trio.to_thread.run_sync(lambda: None) end = time.monotonic() print("{:.2f} µs/job".format((end - start) / COUNT * 1e6)) trio.run(main)
Right now,
run_in_worker_thread
just always spawns a new thread for the operation, and then kills it after. This might sound ridiculous, but it's not so obviously wrong as it looks! There's a large comment intrio._threads
talking about some of the issues.Questions:
run_in_worker_thread
to say that it shouldn't block waiting for a thread because it might unblock a thread?Prior art: https://twistedmatrix.com/trac/ticket/5298
The text was updated successfully, but these errors were encountered: