fatal error: concurrent map read and map write #18

dazoot · 2021-06-24T16:02:03Z

One of our caddy instances was renewing ~5 certificates and we got this error in the log:

{"level":"info","ts":1624549442.7298193,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom1.eu"}
{"level":"info","ts":1624549442.745405,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom2.eu"}
{"level":"info","ts":1624549442.7781668,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom3.eu"}
{"level":"info","ts":1624549442.7857873,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom4.eu"}
{"level":"info","ts":1624549442.7933903,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom5.eu"}
{"level":"info","ts":1624549443.8955014,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13872","distributed":true}
{"level":"info","ts":1624549444.227819,"logger":"tls","msg":"served key authentication certificate","server_name":"dom1.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13964","distributed":true}
{"level":"info","ts":1624549444.2303605,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"3.120.130.29:35008","distributed":true}
{"level":"info","ts":1624549444.2698698,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"66.133.109.36:30662","distributed":true}
{"level":"info","ts":1624549444.3561459,"logger":"tls","msg":"served key authentication certificate","server_name":"dom5.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13990","distributed":true}
{"level":"info","ts":1624549444.4526017,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"18.116.86.117:42258","distributed":true}
{"level":"info","ts":1624549444.5201306,"logger":"tls","msg":"served key authentication certificate","server_name":"dom4.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:14026","distributed":true}
{"level":"info","ts":1624549444.5312564,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"34.221.186.243:63866","distributed":true}
{"level":"info","ts":1624549445.189281,"logger":"tls","msg":"served key authentication certificate","server_name":"dom1.eu","challenge":"tls-alpn-01","remote":"18.116.86.117:42442","distributed":true}
{"level":"info","ts":1624549445.2701666,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"3.142.122.14:16940","distributed":true}
{"level":"info","ts":1624549445.3078008,"logger":"tls","msg":"served key authentication certificate","server_name":"dom4.eu","challenge":"tls-alpn-01","remote":"3.142.122.14:16962","distributed":true}
{"level":"info","ts":1624549445.4730475,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"66.133.109.36:30916","distributed":true}
{"level":"info","ts":1624549445.698484,"logger":"tls.renew","msg":"lock acquired","identifier":"dom2.eu"}
{"level":"info","ts":1624549445.7244895,"logger":"tls.renew","msg":"certificate appears to have been renewed already","identifier":"dom2.eu","remaining":7772397.275513487}
{"level":"info","ts":1624549445.72452,"logger":"tls.renew","msg":"releasing lock","identifier":"dom2.eu"}

fatal error: concurrent map read and map write

goroutine 942994 [running]:
runtime.throw(0x18a199d, 0x21)
        runtime/panic.go:1117 +0x72 fp=0xc000067d20 sp=0xc000067cf0 pc=0x438652
runtime.mapaccess2_faststr(0x1641de0, 0xc004ab5e90, 0xc005faaae0, 0x1c, 0xc0017b5980, 0xc005024118)
        runtime/map_faststr.go:116 +0x4a5 fp=0xc000067d90 sp=0xc000067d20 pc=0x414505
github.com/pteich/caddy-tlsconsul.ConsulStorage.Unlock(0x0, 0x0, 0xc0043410e0, 0xc001aa2688, 0xc004ab5e90, 0xc004bfb920, 0x19, 0x0, 0x0, 0xa, ...)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:87 +0x66 fp=0xc000067e70 sp=0xc000067d90 pc=0x14f70a6
github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock.func1(0xc0050240c0, 0xc0015d2120, 0xc005faaae0, 0x1c)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:75 +0xb8 fp=0xc000067fc0 sp=0xc000067e70 pc=0x14fa3d8
runtime.goexit()
        runtime/asm_amd64.s:1371 +0x1 fp=0xc000067fc8 sp=0xc000067fc0 pc=0x4728a1
created by github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:73 +0x68e

goroutine 1 [select (no cases), 335 minutes]:
github.com/caddyserver/caddy/v2/cmd.cmdRun(0xc00003a240, 0x0, 0x0, 0x0)
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/commandfuncs.go:276 +0x1395
github.com/caddyserver/caddy/v2/cmd.Main()
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/main.go:85 +0x25b
main.main()
        caddy/main.go:15 +0x25

goroutine 9 [select, 335 minutes]:
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit(0xc000097180)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:216 +0xb2
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop(0xc000097180)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:89 +0xa8
created by github.com/caddyserver/certmagic.NewRateLimiter
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:45 +0x148

goroutine 49 [chan receive, 335 minutes]:
github.com/caddyserver/caddy/v2.trapSignalsCrossPlatform.func1()
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap.go:42 +0x129
created by github.com/caddyserver/caddy/v2.trapSignalsCrossPlatform
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap.go:37 +0x35

goroutine 50 [chan receive, 335 minutes]:
github.com/caddyserver/caddy/v2.trapSignalsPosix.func1()
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap_posix.go:34 +0x139
created by github.com/caddyserver/caddy/v2.trapSignalsPosix

The whole caddy process died.
Any hint of what could have gone wrong ?

The text was updated successfully, but these errors were encountered:

dazoot · 2021-06-29T11:54:54Z

Any hint on this one ?
It's occurring daily.

dazoot · 2021-06-29T11:58:12Z

Now it happend when getting a new certificate first time:

{"level":"info","ts":1624962005.2610888,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"3.142.122.14:55728","distributed":true}
{"level":"info","ts":1624962005.6160824,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"34.221.186.243:58126","distributed":true}
{"level":"info","ts":1624962006.940489,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"66.133.109.36:56752","distributed":true}
{"level":"info","ts":1624962008.757574,"logger":"tls.obtain","msg":"acquiring lock","identifier":"domain.org"}
{"level":"info","ts":1624962008.8545318,"logger":"tls.obtain","msg":"lock acquired","identifier":"domain.org"}
{"level":"info","ts":1624962008.8653884,"logger":"tls.obtain","msg":"certificate already exists in storage","identifier":"domain.org"}
{"level":"info","ts":1624962008.8654132,"logger":"tls.obtain","msg":"releasing lock","identifier":"domain.org"}
fatal error: concurrent map read and map write

goroutine 1084603 [running]:
runtime.throw(0x18a199d, 0x21)
        runtime/panic.go:1117 +0x72 fp=0xc0009b9520 sp=0xc0009b94f0 pc=0x438652
runtime.mapaccess2_faststr(0x1641de0, 0xc001ca0930, 0xc005394ae0, 0x1c, 0xc001a15e60, 0xc00173f618)
        runtime/map_faststr.go:116 +0x4a5 fp=0xc0009b9590 sp=0xc0009b9520 pc=0x414505
github.com/pteich/caddy-tlsconsul.ConsulStorage.Unlock(0x0, 0x0, 0xc00126b680, 0xc00008e270, 0xc001ca0930, 0xc0011548a0, 0x19, 0x0, 0x0, 0xa, ...)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:87 +0x66 fp=0xc0009b9670 sp=0xc0009b9590 pc=0x14f70a6
github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock.func1(0xc00173f5c0, 0xc002152d80, 0xc005394ae0, 0x1c)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:75 +0xb8 fp=0xc0009b97c0 sp=0xc0009b9670 pc=0x14fa3d8
runtime.goexit()
        runtime/asm_amd64.s:1371 +0x1 fp=0xc0009b97c8 sp=0xc0009b97c0 pc=0x4728a1
created by github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:73 +0x68e

goroutine 1 [select (no cases), 2384 minutes]:
github.com/caddyserver/caddy/v2/cmd.cmdRun(0xc0001a97a0, 0x0, 0x0, 0x0)
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/commandfuncs.go:276 +0x1395
github.com/caddyserver/caddy/v2/cmd.Main()
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/main.go:85 +0x25b
main.main()
        caddy/main.go:15 +0x25

goroutine 24 [select, 2384 minutes]:
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit(0xc0001b7d60)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:216 +0xb2
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop(0xc0001b7d60)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:89 +0xa8
created by github.com/caddyserver/certmagic.NewRateLimiter
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:45 +0x148

dazoot · 2021-07-01T13:49:41Z

@pteich can we sponsor this issue ?

pteich · 2021-07-01T14:51:50Z

Hi @dazoot - I'm really sorry, but I probably missed the issue notification in all the fuzz 😢
That should not happen and I'll have a look if it's just an issue that I can fix with a mutex or it something upstream (Consul lib) related.

pteich · 2021-07-05T19:37:51Z

@dazoot Thanks for reporting this problem. There was indeed a map that was not secured for concurrent read writes. This is fixed now. I'll create a new version v1.3.3 that includes this fix.

dazoot · 2021-07-06T09:38:28Z

It seems now the tls register / renew certificate is stuck in acquiring lock:

{"level":"info","ts":1625564182.286606,"logger":"tls.obtain","msg":"acquiring lock","identifier":"nl.domain.ro"}

pteich · 2021-07-06T09:44:39Z

Strange, it worked on my local setup but I'll check it out on a larger installation.

Refers to #18

pteich · 2021-07-07T08:12:11Z

@dazoot I'll introduced a fix with another new version v1.3.4 that I had now running over 12hours and successfully received new certificates. I now consider this bug finally resolved.

dazoot · 2021-07-18T00:49:40Z

After a couple of weeks i have noticed that we have stuck certificates still. After checking the nodes i see that they get stuck waiting for the lock.

{"level":"info","ts":1626567854.1713824,"logger":"tls.renew","msg":"acquiring lock","identifier":"cdn.domain.ro"}

Can this lock acquire be set to timeout after a while ?

pteich · 2021-07-18T08:11:11Z

This is a log message from Caddy (or better certmagic) but you don't see any errors or messages after this appears?
I probably have to add some debug logs to get an impression if the Consul plugin is even called and where it breaks.
By now I have a both, a local lock for quickly checking if the instance already has the lock (this is the one with concurrency problems before) and the real distributed Consul locks. I can add a lock wait time for getting this Consul lock. Maybe in some situations it just takes too long to get this lock or Consul is not responsive and everything get stuck.

I'll add this. Maybe as a new config option. And I'll also add debug logs to get more helpful messages in such cases.

dazoot · 2021-07-18T11:24:36Z

    {"level":"info","ts":1626603361.6669,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["nl.dom.ro"],"remaining":2588554.333102271}
    {"level":"info","ts":1626603361.858531,"logger":"tls.renew","msg":"acquiring lock","identifier":"nl.dom.ro"}

I have set all nodes except one in Caddy with renew_interval very high (ex: 15d) so just one node is actually doing certificates renew.

Even when single node is used the lock acquire never comes. Stuck in limbo :)

{"level":"info","ts":1626596162.0471478,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2591886.952852568}
{"level":"info","ts":1626596162.0472286,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["img.dom.ro"],"remaining":2591886.952771762}
{"level":"info","ts":1626596162.3396833,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}
{"level":"info","ts":1626599762.3353257,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2588286.66467504}
{"level":"info","ts":1626603361.6667585,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2584687.333243155}
{"level":"info","ts":1626606961.9950764,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2581087.004924088}
{"level":"info","ts":1626607212.4525259,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}

dazoot · 2021-07-18T11:26:37Z

Is the real distributed Consul lock required ?
Maybe we can have it optional.

Try to fix deadlock from #18

pteich · 2021-07-18T12:35:28Z

For my understanding it is needed in a Caddy cluster to share the lock across the instances so that only one instance can renew or apply for new certificates.

I've changed the code in master to just try once to acquire the lock in Consul and otherwise just fail. I think that is enough for that use case. If it gets the lock - everything is ok. If not, probably another Caddy has it (or an error occurred) and therefor it's ok to return an error. So there is no need to wait forever to get it.

The code now also uses the timeout in seconds than can already be configured for lock wait timeouts:

storage consul {
// ...
   timeout      10
// ...
}

dazoot · 2021-07-18T20:15:06Z

Great. Can you create a release pls ?

pteich · 2021-07-18T20:37:27Z

Done!

Out of curiosity: How many domains and requests do your cluster roughly serve? I have nearly 2.5Mill hits/day with a 4 Caddy cluster (but only <100 domains) and never ran into similar problems. So thanks for your interesting findings that helped improving ;)

dazoot · 2021-07-18T20:39:54Z

We have about 4000 domains. Far less requests (link rewrite mostly). About 7 nodes. Will try it out soon. Thanks.

dazoot · 2021-07-19T10:05:55Z

Still hangs.
We have to renew ~50 certs.
After restart it renews about 5, then i see a couple of "tls.renew","msg":"acquiring lock" log messages and nothing.

Does not seem to reach the consul locking part (ex: the timeout or the logging you have added).

pteich · 2021-07-19T11:06:44Z

Strange. I've added some simple debug messages in this version that at least show if it reaches my code. But I'm not sure how to enable debug log in Caddy.

dazoot · 2021-07-19T16:43:09Z

Can it be the local locking and not the Consul lokcing which blocks ?

dazoot · 2021-07-19T17:26:48Z

And i think the sink logging in Caddy uses INFO as log level. I did not find a way to change it. Can we change it in the module from DEBUG to INFO ? In general there should not be that much noise form this module so INFO would be ok.

pteich · 2021-07-19T17:40:36Z

I'll change it to info logging. Hopefully we can at least where it got stuck.

dazoot · 2021-07-19T17:55:04Z

I did some local tests (changed Debugf to Infof).

It does not go over the local locking. It does not reach the Consul part.

cat /var/log/caddyserver.log | grep caddy.storage | grep -v "loading data from"  | grep scuba
{"level":"info","ts":1626716859.5458074,"logger":"caddy.storage.consul","msg":"trying lock for issue_cert_nl.scubadomain.tld"}

There should be next an attempt to create the Consul lock:

func (cs *ConsulStorage) Lock(ctx context.Context, key string) error {
	cs.logger.Infof("trying lock for %s", key)

	if _, isLocked := cs.GetLock(key); isLocked {
		return nil
	}

	// prepare the distributed lock
	cs.logger.Infof("creating Consul lock for %s", key)

pteich · 2021-07-19T18:14:26Z

I'm pretty sure it got stuck because another process holds the mutex. The reason could be the unlock function.

I've change the code for this and created a new release. I also switched from debug to info as you did locally.

dazoot · 2021-07-19T18:26:57Z

New release ?

pteich · 2021-07-19T18:40:53Z

I've already created v1.3.6

dazoot · 2021-07-19T18:56:30Z

Seems ok now. All certs were renewed but with Zerossl (fallback). Letsencrypt was down a while ago.

So now i have for some hosts 2 certs. One from LE which expired in 20 days and a fresh one from Zerossl.

Is this handled by Caddy or this module ?

pteich · 2021-07-19T19:06:59Z

This is handled by Certmagic inside Caddy and should be no problem (at least I had the same some time ago with some domains). This module just loads and saves the data by request.

dazoot · 2021-07-19T19:33:48Z

So the expired cert will be deleted eventually ?

pteich · 2021-07-19T19:43:05Z

Exactly, the Certmagic Storage interface contains a Delete function for this and this module implements it.

Fixes #18

Refers to #18

Try to fix deadlock from #18

pteich closed this as completed in edb4786 Jul 5, 2021

pteich self-assigned this Jul 5, 2021

pteich added the bug label Jul 5, 2021

pteich reopened this Jul 6, 2021

pteich added a commit that referenced this issue Jul 6, 2021

Improve local locking code

db99563

Refers to #18

pteich closed this as completed Jul 7, 2021

pteich added a commit that referenced this issue Jul 18, 2021

Switch to try-once Consul lock

f6706a8

Try to fix deadlock from #18

pteich added a commit that referenced this issue Jan 3, 2022

Add mutex to protect local locks map

3d77a33

Fixes #18

pteich added a commit that referenced this issue Jan 3, 2022

Improve local locking code

334a60d

Refers to #18

pteich added a commit that referenced this issue Jan 3, 2022

Switch to try-once Consul lock

c94fa7b

Try to fix deadlock from #18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fatal error: concurrent map read and map write #18

fatal error: concurrent map read and map write #18

dazoot commented Jun 24, 2021

dazoot commented Jun 29, 2021

dazoot commented Jun 29, 2021

dazoot commented Jul 1, 2021

pteich commented Jul 1, 2021

pteich commented Jul 5, 2021

dazoot commented Jul 6, 2021

pteich commented Jul 6, 2021

pteich commented Jul 7, 2021

dazoot commented Jul 18, 2021

pteich commented Jul 18, 2021

dazoot commented Jul 18, 2021

dazoot commented Jul 18, 2021

pteich commented Jul 18, 2021

dazoot commented Jul 18, 2021

pteich commented Jul 18, 2021

dazoot commented Jul 18, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

fatal error: concurrent map read and map write #18

fatal error: concurrent map read and map write #18

Comments

dazoot commented Jun 24, 2021

dazoot commented Jun 29, 2021

dazoot commented Jun 29, 2021

dazoot commented Jul 1, 2021

pteich commented Jul 1, 2021

pteich commented Jul 5, 2021

dazoot commented Jul 6, 2021

pteich commented Jul 6, 2021

pteich commented Jul 7, 2021

dazoot commented Jul 18, 2021

pteich commented Jul 18, 2021

dazoot commented Jul 18, 2021

dazoot commented Jul 18, 2021

pteich commented Jul 18, 2021

dazoot commented Jul 18, 2021

pteich commented Jul 18, 2021

dazoot commented Jul 18, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021

dazoot commented Jul 19, 2021

pteich commented Jul 19, 2021