This package ThreadKit
contains the following threading utilities:
- Thread pool: simple and usable thread pool.
- Thread tracer: Lightweight inline thread profiler.
- Skinny mutex: Low-memory-footprint mutexes for POSIX Threads.
- Tasklet: Very lightweight thread without its own stack.
Currently, this thread pool implementation
- Works with pthreads only, but API is intentionally opaque to allow other implementations
- Starts all threads on creation of the thread pool.
- Reserves one task for signaling the queue is full.
- Stops and joins all worker threads on destroy.
Allow some additional options:
- Lazy creation of threads
- Reduce number of threads automatically
- Unlimited queue size
- Kill worker threads on destroy
- Reduce locking contention
ThreadTracer
is a lightweight inline profiler that measures wall-time,
cpu-time and premptive context switches for threads.
ThreadTracer is an inline profiler that is special in the following ways:
- Fully supports multi threaded applications.
- Will never cause your thread to go to sleep because of profiling.
- Will not miss events.
- Will detect if threads were context-switched by scheduler, preemptively or voluntarily.
- Computes duty-cycle for each scope: not just how long it ran, but also how much of that time, it was scheduled on a core.
- Small light weight system, written in C. Just one header and one small implementation file.
- Zero dependencies.
- Doesn't show a live profile, but creates a report after the run, viewable with Google Chrome.
- Currently does not support asynchronous events that start on one thread, and finish on another.
#include "threadtracer.h"
// Each thread that will be generating profiling events needs to be made known to the system.
TT_ENTRY();
// C Programs need to wrap sections of code with a begin and end macro.
TT_BEGIN("simulation");
simulate( dt );
TT_END("simulation");
// When you are done profiling, typically at program end, or earlier, you can generate the profile report.
TT_REPORT();
Start the Google Chrome browser, and in the URL bar, type chrome://tracing
and
then load the genererated threadtracer*.json
file.
Note that for the highlighted task, the detail view shows that the thread got interrupted once preemptively, which causes it to run on a CPU core for only 81% of the time that the task took to complete.
The shading of the time slices shows the duty cycle: how much of the time was spend running on a core.
To avoid recording samples right after launch, you can skip the first seconds of recording with an environment variable. To skip the first five seconds, do:
$ THREADTRACERSKIP=5 ./foo
ThreadTracer: clock resolution: 1 nsec.
ThreadTracer: skipping the first 5 seconds before recording.
ThreadTracer: Wrote 51780 events (6 discarded) to threadtracer.json
- chrome://tracing for their excellent in-browser visualization.
The main kind of lock provided by the pthreads API is the mutex
(pthread_mutex_t
). These have a lot of features (enabled though the
attributes set in pthread_mutexattr_t
), integrate with condition variables,
and handle contention gracefully.
But a drawback is their size. On Linux, a pthread_mutex_t
occupies 64 bytes
on 64-bit machines. If the mutex is protecting a small data structure, this
can lead to unwelcome overheads in memory usage, and reduce the effectiveness
of caches.
Some pthreads implementations also have spinlocks (pthread_spinlock_t
).
These are smaller (4 bytes on Linux). But they don't handle contention
gracefully, so they are best used for critical sections containing small
amounts of code that can be verified to have a short bounded running time.
Hence skinny mutexes provide mutexes that occupy one pointer-sized word. Like pthreads mutexes, they integrate with condition variables and handle contention gracefully, so code using pthreads mutexes can be easily converted to use skinny mutexes instead.
Skinny mutexes use atomic operations to when possible (e.g. when locking or
unlocking an uncontended skinny mutex), and fall back to the pthreads
primitives when necessary (e.g. when a lock is contended causing a thread to
block). So you will still need to compile with -pthread
. Performance should
generally be similar to pthreads mutexes, and it might even be better in some
cases.
Pthread | Skinny mutex |
---|---|
pthread_mutex_t |
skinny_mutex_t |
pthread_mutex_init |
skinny_mutex_init |
pthread_mutex_destroy |
skinny_mutex_destroy |
pthread_mutex_lock |
skinny_mutex_lock |
pthread_mutex_unlock |
skinny_mutex_unlock |
pthread_mutex_trylock |
skinny_mutex_trylock |
pthread_cond_wait |
skinny_mutex_cond_wait |
pthread_cond_timedwait |
skinny_mutex_cond_timedwait |
PTHREAD_MUTEX_INITIALIZER |
SKINNY_MUTEX_INITIALIZER |
Note that skinny_mutex_init
does not take an attributes argument (see below
for more details). Other than that, all the arguments of the functions
mentioned correspond to the pthreads ones, and their specifications and
return values are intended to correspond exactly.
In particular, skinny_mutex_lock
is not a thread cancellation point, and
skinny_mutex_cond_wait
is.
Unlike pthreads mutexes, skinny mutexes do not currently support any mutex
attributes. Their behavior corresponds to the default pthread mutex
attributes (i.e. with NULL
passed as the second argument to
pthread_mutex_init
).
It is possible to add support for error checking corresponding to the
PTHREAD_MUTEX_ERRORCHECK
type attribute (from pthread_mutexattr_settype
).
This will probably be a compile-time option.
It seems feasible to add support for the protocol attribute
(PTHREAD_PRIO_INHERIT
and PTHREAD_PRIO_PROTECT
from
pthread_mutexattr_setprotocol
). There might be room for improvements.
The PTHREAD_MUTEX_RECURSIVE
type attribute will not be supported, as
it would require skinny_mutex_t
to grow, and you can rewrite code to
avoid the need for recursive mutexes.
Support for the process-shared and priority ceiling attributes
(pthread_mutexattr_setpshared
and
pthread_mutexattr_setprioceiling
) is also unlikely, as they seem to
be of marginal usefulness and/or hard to implement.
A tasklet is a sequential context of execution. Like a thread, a tasklet can wait for events (such as data arriving on a socket). Unlike a thread, a tasklet does not have its own stack, so tasklet code has to be follow certain idioms. But those idioms are less cumbersome than trying to write callback-based code in C, particularly in a multithreaded context.
Tasklets are very lightweight; many millions of tasklets could fit in the memory of a modern machine. A scalable service can schedule runnable tasklets onto a much smaller number of threads.