sync: RWMutex scales poorly with CPU count #17973

bcmills · 2016-11-18T00:38:47Z

On a machine with many cores, the performance of sync.RWMutex.R{Lock,Unlock} degrades dramatically as GOMAXPROCS increases.

This test program:

package benchmarks_test

import (
	"fmt"
	"sync"
	"testing"
)

func BenchmarkRWMutex(b *testing.B) {
	for ng := 1; ng <= 256; ng <<= 2 {
		b.Run(fmt.Sprint(ng), func(b *testing.B) {
			var mu sync.RWMutex
			mu.Lock()

			var wg sync.WaitGroup
			wg.Add(ng)

			n := b.N
			quota := n / ng

			for g := ng; g > 0; g-- {
				if g == 1 {
					quota = n
				}

				go func(quota int) {
					for i := 0; i < quota; i++ {
						mu.RLock()
						mu.RUnlock()
					}
					wg.Done()
				}(quota)

				n -= quota
			}

			if n != 0 {
				b.Fatalf("Incorrect quota assignments: %v remaining", n)
			}

			b.StartTimer()
			mu.Unlock()
			wg.Wait()
			b.StopTimer()
		})
	}
}

degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:

# ./benchmarks.test -test.bench . -test.cpu 1,4,16,64
testing: warning: no tests to run
BenchmarkRWMutex/1      20000000                72.6 ns/op
BenchmarkRWMutex/1-4    20000000                72.4 ns/op
BenchmarkRWMutex/1-16   20000000                72.8 ns/op
BenchmarkRWMutex/1-64   20000000                72.5 ns/op
BenchmarkRWMutex/4      20000000                72.6 ns/op
BenchmarkRWMutex/4-4    20000000               105 ns/op
BenchmarkRWMutex/4-16   10000000               130 ns/op
BenchmarkRWMutex/4-64   20000000               160 ns/op
BenchmarkRWMutex/16     20000000                72.4 ns/op
BenchmarkRWMutex/16-4   10000000               125 ns/op
BenchmarkRWMutex/16-16  10000000               263 ns/op
BenchmarkRWMutex/16-64   5000000               287 ns/op
BenchmarkRWMutex/64     20000000                72.6 ns/op
BenchmarkRWMutex/64-4   10000000               137 ns/op
BenchmarkRWMutex/64-16   5000000               306 ns/op
BenchmarkRWMutex/64-64   3000000               517 ns/op
BenchmarkRWMutex/256                    20000000                72.4 ns/op
BenchmarkRWMutex/256-4                  20000000               137 ns/op
BenchmarkRWMutex/256-16                  5000000               280 ns/op
BenchmarkRWMutex/256-64                  3000000               602 ns/op
PASS

A "control" test, calling a no-op function instead of RWMutex methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.

The text was updated successfully, but these errors were encountered:

josharian · 2016-11-18T04:44:25Z

Possibly of interest for this: http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

josharian · 2016-11-18T04:44:53Z

cc @dvyukov

ianlancetaylor · 2016-11-18T14:39:37Z

It may be difficult to apply the algorithm described in that paper to our existing sync.RWMutex type. The algorithm requires an association between the read-lock operation and the read-unlock operation. It can be implemented by having read-lock/read-unlock always occur on the same thread or goroutine, or by having the read-lock operation return a pointer that is passed to the read-unlock operation. Basically the algorithm builds a tree to avoid contention, and requires each read-lock/read-unlock pair to operate on the same node of the tree.

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond. That is, it doesn't seem to require any special relationship with the runtime package.

dvyukov · 2016-11-18T14:50:59Z

Locking per-P slot may be enough and is much simpler:
https://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/co/distributedrwmutex.go

ianlancetaylor · 2016-11-18T14:53:52Z

What happens when a goroutine moves to a different P between read-lock and read-unlock?

dvyukov · 2016-11-18T14:57:38Z

RLock must return a proxy object to unlock, that object must hold the locked P index.

bcmills · 2016-11-18T17:42:12Z

The existing RWMutex API allows the RLock call to occur on a different goroutine from the RUnlock call. We can certainly assume that most RLock / RUnlock pairs occur on the same goroutine (and optimize for that case), but I think there needs to be a slow-path fallback for the general case.

(For example, you could envision an algorithm that attempts to unlock the slot for the current P, then falls back to a linear scan if the current P's slot wasn't already locked.)

bcmills · 2016-11-18T18:10:17Z

At any rate: general application code can work around the problem (in part) by using per-goroutine or per-goroutine-pool caches rather than global caches shared throughout the process.

The bigger issue is that sync.RWMutex is used fairly extensively within the standard library for package-level locks (the various caches in reflect, http.statusMu, json.encoderCache, mime.mimeLock, etc.), so it's easy for programs to fall into contention traps and hard to apply workarounds without avoiding large portions of the standard library. For those use-cases, it might actually be feasible to switch to something with a different API (such as having RLock return an unlocker).

dvyukov · 2016-11-18T18:17:34Z

For these cases in std lib atomic.Value is much better fit. It is already used in json, gob and http. atomic.Value is perfectly scalable and virtually zero overhead for readers.

bcmills · 2016-11-18T18:27:38Z

I agree in general, but it's not obvious to me how one could use atomic.Value to guard lookups in a map acting as a cache. (It's perfect for maps which do not change, but how would you add new entries to the caches with that approach?)

dvyukov · 2016-11-18T18:41:16Z

What kind of cache do you mean?

ianlancetaylor · 2016-11-18T18:46:45Z

E.g., the one in reflect.ptrTo.

dvyukov · 2016-11-18T18:52:47Z

See e.g. encoding/json/encode.go:cachedTypeFields

bcmills · 2016-11-18T19:17:11Z

Hmm... that trades a higher allocation rate (and O(N) insertion cost) in exchange for getting the lock out of the reader path. (It basically pushes the "read lock" out to the garbage collector.)

And since most of these maps are insert-only (never deleted from), you can at least suspect that the O(N) insert won't be a huge drag: if there were many inserts, the maps would end up enormously large.

It would be interesting to see whether the latency tradeoff favors the RWMutex overhead or the O(N) insert overhead for more of the standard library.

ianlancetaylor · 2016-11-18T19:27:37Z

@dvyukov Thanks. For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

josharian · 2016-11-19T19:27:50Z

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond.

Indeed. Seems like it might be worth an experiment. If nothing else, might end up being useful at go4.org.

For these cases in std lib atomic.Value is much better fit.

Not always. atomic.Value can end up being a lot more code and complication. See CL 2641 for a worked example. For low level performance critical things like reflect, I'm all for atomic.Value, but much of the rest of the standard library, it'd be nice to fix the scalability of RWMutex (or have a comparably easy to use alternative).

dvyukov · 2016-11-21T08:41:29Z

Note that in most of these cases insertions happen very, very infrequently. Only during server warmup when it receives a first request of a new type or something. While reads happen all the time. Also, no matter how scalable RWMutex is, it still blocks all readers during updates increasing latency and causing large overheads for blocking/unblocking.

For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

Just benchmarked it on realistic benchmarks in my head. It is good :)

dvyukov · 2016-11-21T08:46:18Z

See CL 2641 for a worked example.

I would not say that it is radically more code and complication. Provided that one does it right the first time, rather than do it non-scalable first and then refactor everything.

gopherbot · 2016-11-21T21:05:29Z

CL https://golang.org/cl/33411 mentions this issue.

bcmills · 2016-12-01T23:13:19Z

https://go-review.googlesource.com/#/c/33852/ has a draft for a more general API for maps of the sort used in the standard library; should I send that for review? (I'd put it in the x/sync repo for now so we can do some practical experiments.)

davidlazar · 2016-12-02T22:59:01Z

@jonhoo built a more scalable RWMutex here: https://github.com/jonhoo/drwmutex/

minux · 2016-12-03T00:57:48Z

One problem of https://github.com/jonhoo/drwmutex/ is that it doesn't handle the case when the user increases GOMAXPROCS at runtime (because New() just allocates a static slice of Mutexes.)

bcmills · 2016-12-03T04:43:18Z

@minux That would be easy enough to fix by using runtime.NumCPU() instead.

minux · 2016-12-03T06:23:15Z

Sure, but it wastes significant amount of memory when the system has a lot processors. Real per-cpu mutex should tie to P.

jonhoo · 2016-12-03T18:21:41Z

@minux I did at some point have a benchmark running on a modified version of Go that used P instead of CPUID. Unfortunately, I can't find that code any more, but from memory it got strictly worse performance than the CPUID-based solution. The situation could of course be different now though.

gopherbot · 2020-01-19T08:02:21Z

Change https://golang.org/cl/215359 mentions this issue: sync: add benchmark for issue 17973.

gopherbot · 2020-01-19T08:02:23Z

Change https://golang.org/cl/215362 mentions this issue: sync: Implement a procLocal abstraction.

balasanjay · 2020-02-13T20:45:53Z

Update: the two problems with RWMutex is that it has poor multi-core scalability (this issue) and that its fairly big (takes up 40% of a cache-line on its own). I originally decided to look at multi-core scalability first, but upon reflection, it makes more sense to tackle these problems in the other order. I intend to return to this issue once #37142 is resolved.

puzpuzpuz · 2021-08-25T18:16:50Z

There is one more possible approach to making RWMutex more scalable for readers - BRAVO (Biased Locking for Reader-Writer Locks) algorithm:
https://github.com/puzpuzpuz/xsync#rbmutex

It may be seen as a variation of D.Vyukov's DistributedRWMutex. Yet, the implementation is different since it wraps a single RWMutex instance and uses an array of reader slots to distribute the RLock attempts. It also returns a proxy object to the readers and internally uses a sync.Pool to piggyback on its thread-local behavior (obviously, that's a user-land workaround, not something mandatory).

As you'd expect, reader acquires scale better at the cost of more expensive writer locks.

I'm posting this for the sake of listing all possible approaches.

see issue golang/go#17973

An experiment using drwmutex [1] to speed up read lock contention on 96 vCPUs, as observed in [2]. The final run of `kv95/enc=false/nodes=3/cpu=96` exhibited average throughput of 173413 ops/sec. That's worse than the implementation without RWMutex. It appears that read lock, as implemented by Go's runtime scales poorly to a high number of vCPUs [3]. On the other hand, the write lock under drwmutex requires acquiring 96 locks in this case, which appears to be the only bottleneck; the sharded read lock is optimal enough that it doesn't show up on the cpu profile. The only slow down appears to be the write lock inside getStatsForStmtWithKeySlow which is unavoidable. Although inconclusive, it appears that drwmutex doesn't scale well above a certain number of vCPUs, when the write mutex is on a critical path. [1] https://github.com/jonhoo/drwmutex [2] cockroachdb#109443 [3] golang/go#17973 Epic: none Release note: None

ntsd · 2023-12-22T01:53:30Z

I made a benchmark, But not sure if the code is correct. Hope this helps.

for 10 concurrent 100k iters per each, RWMutex is fine in write, and slightly better in read.
for 100 concurrent 10k iters per each, RWMutex is slower in write, and impressive in read.
What surprises me is sync.Map did better in more concurrency.

https://github.com/ntsd/go-mutex-comparison?tab=readme-ov-file#test-scenarios

quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 18, 2016

quentinmit added this to the Go1.8Maybe milestone Nov 18, 2016

rsc modified the milestones: Go1.9Early, Go1.8Maybe Nov 21, 2016

bcmills mentioned this issue Dec 3, 2016

sync: add a Map to replace RWLock+map usage #18177

Closed

bcmills mentioned this issue Feb 13, 2020

sync: shrink types in sync package #37142

Open

bcmills mentioned this issue May 4, 2020

sync: RWMutex.RLock is 2x slower than Mutex.Lock #38813

Closed

lalonde mentioned this issue Jun 12, 2020

Change Mutex to RWMutex for connectivityStateManager grpc/grpc-go#3682

Closed

sbunce mentioned this issue Nov 9, 2020

memmetrics: simplify locking and solve data race vulcand/oxy#209

Open

bcmills mentioned this issue Nov 13, 2020

context: cancelCtx exclusive lock causes extreme contention #42564

Closed

amyangfei mentioned this issue Feb 23, 2021

kv-client: use multiple sync.Map to reduce lock contention pingcap/tiflow#1439

Merged

ti-srebot mentioned this issue Mar 2, 2021

kv-client: use multiple sync.Map to reduce lock contention (#1439) pingcap/tiflow#1476

Merged

trozet mentioned this issue Jun 15, 2021

policy: fix gressPolicy data race on delete ovn-kubernetes/ovn-kubernetes#2259

Merged

aojea mentioned this issue Jun 16, 2021

implement UserIdleTimeout moby/spdystream#86

Closed

simar7 mentioned this issue Jul 26, 2021

tracee-rules: evaluate parsed input with OPA aquasecurity/tracee#829

Merged

This was referenced Nov 19, 2021

Replace Mutex with RWMutex in cloud provider to improve efficiency kubernetes/kubernetes#106555

Closed

Fix missing of RLock in SeenAllSources kubernetes/kubernetes#106551

Merged

MrAlias mentioned this issue Nov 23, 2021

Use the grpc.ClientConn to handle connections for the otlptracegrpc client open-telemetry/opentelemetry-go#2329

Merged

ngergs added a commit to ngergs/websrv that referenced this issue Mar 3, 2022

fix: use sync.Map instead of sync.RWMutex

7b17d31

see issue golang/go#17973

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

kolesnikovae mentioned this issue Jul 25, 2022

feat: concurrent storage put grafana/pyroscope#1304

Merged

6 tasks

corhere mentioned this issue Jul 26, 2022

Refactor libcontainerd to minimize containerd RPCs moby/moby#43564

Merged

sumeerbhola mentioned this issue Oct 5, 2022

cache: investigate mutex contention with TPCC cockroachdb/pebble#1997

Closed

srosenberg mentioned this issue Aug 28, 2023

kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 cockroachdb/cockroach#109443

Closed

qiulaidongfeng mentioned this issue Jan 12, 2024

proposal: sync: support for sharded values #18802

Open

Jacalz mentioned this issue Jan 12, 2024

Consider moving from sync.RWMutex to sync.Mutex fyne-io/fyne#4536

Open

2 tasks

mxpv mentioned this issue Sep 29, 2024

internal/nri: use RWMutex to speed up readonly request containerd/containerd#10748

Closed

This was referenced Oct 27, 2024

Use atomic pointer for map access synchronization tg123/go-htpasswd#12

Merged

Use atomic pointer for map access synchronization Snawoot/go-htpasswd#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: RWMutex scales poorly with CPU count #17973

sync: RWMutex scales poorly with CPU count #17973

bcmills commented Nov 18, 2016 •

edited

Loading

josharian commented Nov 18, 2016

josharian commented Nov 18, 2016

ianlancetaylor commented Nov 18, 2016

dvyukov commented Nov 18, 2016

ianlancetaylor commented Nov 18, 2016

dvyukov commented Nov 18, 2016

bcmills commented Nov 18, 2016

bcmills commented Nov 18, 2016 •

edited

Loading

dvyukov commented Nov 18, 2016

bcmills commented Nov 18, 2016

dvyukov commented Nov 18, 2016

ianlancetaylor commented Nov 18, 2016

dvyukov commented Nov 18, 2016

bcmills commented Nov 18, 2016 •

edited

Loading

ianlancetaylor commented Nov 18, 2016

josharian commented Nov 19, 2016

dvyukov commented Nov 21, 2016

dvyukov commented Nov 21, 2016

gopherbot commented Nov 21, 2016

bcmills commented Dec 1, 2016

davidlazar commented Dec 2, 2016

minux commented Dec 3, 2016 via email

bcmills commented Dec 3, 2016

minux commented Dec 3, 2016 via email

jonhoo commented Dec 3, 2016

gopherbot commented Jan 19, 2020

gopherbot commented Jan 19, 2020

balasanjay commented Feb 13, 2020

puzpuzpuz commented Aug 25, 2021

ntsd commented Dec 22, 2023 •

edited

Loading

sync: RWMutex scales poorly with CPU count #17973

sync: RWMutex scales poorly with CPU count #17973

Comments

bcmills commented Nov 18, 2016 • edited Loading

josharian commented Nov 18, 2016

josharian commented Nov 18, 2016

ianlancetaylor commented Nov 18, 2016

dvyukov commented Nov 18, 2016

ianlancetaylor commented Nov 18, 2016

dvyukov commented Nov 18, 2016

bcmills commented Nov 18, 2016

bcmills commented Nov 18, 2016 • edited Loading

dvyukov commented Nov 18, 2016

bcmills commented Nov 18, 2016

dvyukov commented Nov 18, 2016

ianlancetaylor commented Nov 18, 2016

dvyukov commented Nov 18, 2016

bcmills commented Nov 18, 2016 • edited Loading

ianlancetaylor commented Nov 18, 2016

josharian commented Nov 19, 2016

dvyukov commented Nov 21, 2016

dvyukov commented Nov 21, 2016

gopherbot commented Nov 21, 2016

bcmills commented Dec 1, 2016

davidlazar commented Dec 2, 2016

minux commented Dec 3, 2016 via email

bcmills commented Dec 3, 2016

minux commented Dec 3, 2016 via email

jonhoo commented Dec 3, 2016

gopherbot commented Jan 19, 2020

gopherbot commented Jan 19, 2020

balasanjay commented Feb 13, 2020

puzpuzpuz commented Aug 25, 2021

ntsd commented Dec 22, 2023 • edited Loading

bcmills commented Nov 18, 2016 •

edited

Loading

bcmills commented Nov 18, 2016 •

edited

Loading

bcmills commented Nov 18, 2016 •

edited

Loading

ntsd commented Dec 22, 2023 •

edited

Loading