As for any software, the topic of performance is often sensitive and plagued with heated discussions. It is objectively difficult to come up with scientifically accurate figures as they depend on many factors, including at least hardware, operating system, common lisp implementation, optimization flags and usage pattern.
What follows is a list of micro-benchmarks, suitable to have an initial idea about STMX performance for short, non-conflicting transactions.
Setup and optimization flags:
-
Before starting the REPL, it is recommended to remove any cached FASL file by deleting the folder ~/.cache/common-lisp/
-
Start the REPL and execute what follows:
(declaim (optimize (compilation-speed 0) (space 0) (debug 0) (safety 0) (speed 3))) (ql:quickload "stmx") (in-package :stmx.util) (defvar *v* (tvar 0)) (defvar *m* (new 'rbmap :pred 'fixnum<)) (defvar *tm* (new 'tmap :pred 'fixnum<)) (defvar *h* (new 'ghash-table :test 'fixnum= :hash 'identity)) (defvar *th* (new 'thash-table :test 'fixnum= :hash 'identity)) ;; some initial values (setf (get-gmap *m* 1) 0) (setf (get-gmap *tm* 1) 0) (setf (get-ghash *h* 1) 0) (setf (get-ghash *th* 1) 0) (defmacro x3 (&rest body) `(let ((v *v*) (m *m*) (tm *tm*) (h *h*) (th *th*)) (declare (ignorable v m tm h th)) (dotimes (,(gensym) 3) ,@body))) (defmacro 1m (&rest body) `(time (dotimes (i 1000000) ,@body)))
-
to warm-up STMX and the common-lisp process before starting the benchmarks, it is also recommended to run first the test suite with:
(ql:quickload "stmx.test") (fiveam:run! 'stmx.test:suite)
-
Run each benchmark inside an
(atomic ...)
block one million times (see1m
macro above) in a single thread. Repeat each run three times (see3x
macro above) and take the lowest of the three reported elapsed times. Divide by one million to get the average elapsed real time per iteration.This means for example that to run the benchmark
($ v)
one has to type(x3 (1m (atomic ($ v))))
All timings reported in the next section are the output on the author's system of the procedure just described, and thus for each benchmark they contain the average elapsed real time per iteration, i.e. the total elapsed time divided by the number of iterations (one million).
What follows are some timings obtained on the authors's system, and by no means they claim to be exact, absolute or reproducible: your mileage may vary.
Date: 12 April 2015
Hardware: Intel Core-i7 4770 @3.4 GHz (quad-core w/ hyper-threading), 16GB RAM
Software: Debian GNU/Linux 7.0 (x86_64), SBCL 1.2.10 (x86_64), STMX 2.0.4
Single-thread benchmarks, executed one million times
with (x3 (1m (atomic ...)))
| ||||
---|---|---|---|---|
name | executed code | STMX sw-only transactions | STMX hybrid hw+sw (requires Intel TSX and 64-bit SBCL) | HAND-OPTIMIZED hw transactions - see doc/benchmark.lisp |
average time in microseconds | ||||
nil | nil |
0.071 | 0.023 | 0.012 |
read-1 | ($ v) |
0.089 | 0.023 | 0.014 |
write-1 | (setf ($ v) 1) |
0.113 | 0.027 | 0.026 |
read-write-1 | (incf (the fixnum ($ v))) |
0.138 | 0.027 | 0.024 |
read-write-10 | (dotimes (j 10) (incf (the fixnum ($ v)))) |
0.226 | 0.036 | 0.033 |
read-write-100 | (dotimes (j 100) (incf (the fixnum ($ v)))) |
1.114 | 0.197 | 0.196 |
read-write-1000 | (dotimes (j 1000) (incf (the fixnum ($ v)))) |
9.909 | 1.896 | 1.749 |
read-write-N | best fit of the 3 runs above | (0.142+N*0.0098) | (0.0226+N*0.0036) | (0.0260+N*0.0016) |
orelse empty | (orelse) |
0.042 | 0.027 | 0.022 |
orelse unary | (orelse ($ v)) |
0.219 | 0.263 | N/A |
orelse retry-1 | (orelse (retry) ($ v)) |
0.354 | 0.438 | N/A |
orelse retry-2 | (orelse (retry) (retry) ($ v)) |
0.501 | 0.586 | N/A |
orelse retry-4 | (orelse (retry) (retry) |
0.781 | 0.872 | N/A |
orelse retry-N | best fit of the 3 runs above | (0.248+N*0.178) | (0.308+N*0.182) | |
tmap read-1 | (get-gmap tm 1) |
0.261 | 0.184 | 0.178 |
tmap read-write-1 | (incf (get-gmap tm 1)) |
0.558 | 0.419 | 0.409 |
grow tmap from N to N+1 entries (up to 10) | (when (zerop (mod i 10)) (clear-gmap tm)) |
3.232 | 3.706 | |
grow tmap from N to N+1 entries (up to 100) | (when (zerop (mod i 100)) (clear-gmap tm)) |
4.669 | 5.249 | |
grow tmap from N to N+1 entries (up to 1000) | (when (zerop (mod i 1000)) (clear-gmap tm)) |
5.842 | 6.464 | |
thash read-write-1 | (incf (get-ghash th 1)) |
0.956 | 0.566 | |
grow thash from N to N+1 entries (up to 10) | (when (zerop (mod i 10)) (clear-ghash th)) |
1.669 | 1.738 | |
grow thash from N to N+1 entries (up to 100) | (when (zerop (mod i 100)) (clear-ghash th)) |
1.224 | 1.400 | |
grow thash from N to N+1 entries (up to 1000) | (when (zerop (mod i 1000)) (clear-ghash th)) |
1.174 | 1.325 |
Concurrent benchmarks on a 4-core CPU. They already iterate
ten million times, do not wrap them in (1m ...) .
| |||||||
---|---|---|---|---|---|---|---|
Dining philosophers, load with(load "stmx/example/dining-philosophers.stmx.lisp") (load "stmx/example/dining-philosophers.stmx-hw.lisp") (load "stmx/example/dining-philosophers.hw-only.lisp") (load "stmx/example/dining-philosophers.lock.lisp") (in-package :stmx.example.dining-philosophers.[...])
| |||||||
number of threads | executed code | STMX sw-only transactions | STMX hybrid hw+sw | STMX hybrid hw+sw, HAND OPTIMIZED | hw-only, HAND-OPTIMIZED | LOCK (atomic compare-and-swap) | LOCK (bordeaux-threads mutex) |
millions transactions per second | |||||||
1 thread | (dining-philosophers 1) |
4.511 | 24.45 | 34.97 | 50.00 | 68.97 | 14.64 |
2 threads | (dining-philosophers 2) |
8.150 | 9.46 | 11.48 | 26.44 | 56.92 | 11.43 |
3 threads | (dining-philosophers 3) |
11.86 | 9.67 | 10.40 | 30.33 | 51.62 | 9.48 |
4 threads | (dining-philosophers 4) |
15.48 | 11.83 | 13.84 | 32.05 | 44.98 | 14.24 |
5 threads | (dining-philosophers 5) |
13.72 | 13.05 | 15.16 | 32.38 | 63.13 | 18.59 |
6 threads | (dining-philosophers 6) |
14.79 | 13.94 | 14.86 | 37.48 | 72.46 | 19.14 |
7 threads | (dining-philosophers 7) |
15.00 | 14.39 | 13.25 | 43.48 | 86.63 | 20.92 |
8 threads | (dining-philosophers 8) |
15.79 | 13.59 | 14.15 | 47.90 | 102.11 | 23.55 |
10 threads | (dining-philosophers 10) |
15.24 | 13.96 | 16.59 | 56.18 | 117.10 | 30.24 |
15 threads | (dining-philosophers 15) |
15.43 | 16.28 | 21.54 | 88.94 | 165.20 | 49.68 |
20 threads | (dining-philosophers 20) |
15.55 | 18.59 | 21.12 | 142.20 | 203.77 | 53.89 |
25 threads | (dining-philosophers 25) |
188.54 | |||||
30 threads | (dining-philosophers 30) |
15.51 | 15.84 | 16.01 | 211.86 | 235.94 | 57.64 |
35 threads | (dining-philosophers 35) |
260.61 | |||||
40 threads | (dining-philosophers 40) |
15.50 | 20.16 | 15.20 | 278.75 | 254.62 | 58.34 |
50 threads | (dining-philosophers 50) |
15.42 | 16.34 | 19.27 | 272.33 | 262.67 | 58.98 |
100 threads | (dining-philosophers 100) |
15.51 | 275.22 | 274.80 | |||
200 threads | (dining-philosophers 200) |
15.53 | 284.21 | 277.47 |