Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: concurrent map read and map write #18

Closed
dazoot opened this issue Jun 24, 2021 · 29 comments
Closed

fatal error: concurrent map read and map write #18

dazoot opened this issue Jun 24, 2021 · 29 comments
Assignees
Labels

Comments

@dazoot
Copy link

dazoot commented Jun 24, 2021

One of our caddy instances was renewing ~5 certificates and we got this error in the log:

{"level":"info","ts":1624549442.7298193,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom1.eu"}
{"level":"info","ts":1624549442.745405,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom2.eu"}
{"level":"info","ts":1624549442.7781668,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom3.eu"}
{"level":"info","ts":1624549442.7857873,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom4.eu"}
{"level":"info","ts":1624549442.7933903,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom5.eu"}
{"level":"info","ts":1624549443.8955014,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13872","distributed":true}
{"level":"info","ts":1624549444.227819,"logger":"tls","msg":"served key authentication certificate","server_name":"dom1.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13964","distributed":true}
{"level":"info","ts":1624549444.2303605,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"3.120.130.29:35008","distributed":true}
{"level":"info","ts":1624549444.2698698,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"66.133.109.36:30662","distributed":true}
{"level":"info","ts":1624549444.3561459,"logger":"tls","msg":"served key authentication certificate","server_name":"dom5.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13990","distributed":true}
{"level":"info","ts":1624549444.4526017,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"18.116.86.117:42258","distributed":true}
{"level":"info","ts":1624549444.5201306,"logger":"tls","msg":"served key authentication certificate","server_name":"dom4.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:14026","distributed":true}
{"level":"info","ts":1624549444.5312564,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"34.221.186.243:63866","distributed":true}
{"level":"info","ts":1624549445.189281,"logger":"tls","msg":"served key authentication certificate","server_name":"dom1.eu","challenge":"tls-alpn-01","remote":"18.116.86.117:42442","distributed":true}
{"level":"info","ts":1624549445.2701666,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"3.142.122.14:16940","distributed":true}
{"level":"info","ts":1624549445.3078008,"logger":"tls","msg":"served key authentication certificate","server_name":"dom4.eu","challenge":"tls-alpn-01","remote":"3.142.122.14:16962","distributed":true}
{"level":"info","ts":1624549445.4730475,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"66.133.109.36:30916","distributed":true}
{"level":"info","ts":1624549445.698484,"logger":"tls.renew","msg":"lock acquired","identifier":"dom2.eu"}
{"level":"info","ts":1624549445.7244895,"logger":"tls.renew","msg":"certificate appears to have been renewed already","identifier":"dom2.eu","remaining":7772397.275513487}
{"level":"info","ts":1624549445.72452,"logger":"tls.renew","msg":"releasing lock","identifier":"dom2.eu"}

fatal error: concurrent map read and map write

goroutine 942994 [running]:
runtime.throw(0x18a199d, 0x21)
        runtime/panic.go:1117 +0x72 fp=0xc000067d20 sp=0xc000067cf0 pc=0x438652
runtime.mapaccess2_faststr(0x1641de0, 0xc004ab5e90, 0xc005faaae0, 0x1c, 0xc0017b5980, 0xc005024118)
        runtime/map_faststr.go:116 +0x4a5 fp=0xc000067d90 sp=0xc000067d20 pc=0x414505
github.com/pteich/caddy-tlsconsul.ConsulStorage.Unlock(0x0, 0x0, 0xc0043410e0, 0xc001aa2688, 0xc004ab5e90, 0xc004bfb920, 0x19, 0x0, 0x0, 0xa, ...)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:87 +0x66 fp=0xc000067e70 sp=0xc000067d90 pc=0x14f70a6
github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock.func1(0xc0050240c0, 0xc0015d2120, 0xc005faaae0, 0x1c)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:75 +0xb8 fp=0xc000067fc0 sp=0xc000067e70 pc=0x14fa3d8
runtime.goexit()
        runtime/asm_amd64.s:1371 +0x1 fp=0xc000067fc8 sp=0xc000067fc0 pc=0x4728a1
created by github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:73 +0x68e

goroutine 1 [select (no cases), 335 minutes]:
github.com/caddyserver/caddy/v2/cmd.cmdRun(0xc00003a240, 0x0, 0x0, 0x0)
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/commandfuncs.go:276 +0x1395
github.com/caddyserver/caddy/v2/cmd.Main()
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/main.go:85 +0x25b
main.main()
        caddy/main.go:15 +0x25

goroutine 9 [select, 335 minutes]:
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit(0xc000097180)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:216 +0xb2
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop(0xc000097180)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:89 +0xa8
created by github.com/caddyserver/certmagic.NewRateLimiter
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:45 +0x148

goroutine 49 [chan receive, 335 minutes]:
github.com/caddyserver/caddy/v2.trapSignalsCrossPlatform.func1()
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap.go:42 +0x129
created by github.com/caddyserver/caddy/v2.trapSignalsCrossPlatform
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap.go:37 +0x35

goroutine 50 [chan receive, 335 minutes]:
github.com/caddyserver/caddy/v2.trapSignalsPosix.func1()
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap_posix.go:34 +0x139
created by github.com/caddyserver/caddy/v2.trapSignalsPosix

The whole caddy process died.
Any hint of what could have gone wrong ?

@dazoot
Copy link
Author

dazoot commented Jun 29, 2021

Any hint on this one ?
It's occurring daily.

@dazoot
Copy link
Author

dazoot commented Jun 29, 2021

Now it happend when getting a new certificate first time:

{"level":"info","ts":1624962005.2610888,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"3.142.122.14:55728","distributed":true}
{"level":"info","ts":1624962005.6160824,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"34.221.186.243:58126","distributed":true}
{"level":"info","ts":1624962006.940489,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"66.133.109.36:56752","distributed":true}
{"level":"info","ts":1624962008.757574,"logger":"tls.obtain","msg":"acquiring lock","identifier":"domain.org"}
{"level":"info","ts":1624962008.8545318,"logger":"tls.obtain","msg":"lock acquired","identifier":"domain.org"}
{"level":"info","ts":1624962008.8653884,"logger":"tls.obtain","msg":"certificate already exists in storage","identifier":"domain.org"}
{"level":"info","ts":1624962008.8654132,"logger":"tls.obtain","msg":"releasing lock","identifier":"domain.org"}
fatal error: concurrent map read and map write

goroutine 1084603 [running]:
runtime.throw(0x18a199d, 0x21)
        runtime/panic.go:1117 +0x72 fp=0xc0009b9520 sp=0xc0009b94f0 pc=0x438652
runtime.mapaccess2_faststr(0x1641de0, 0xc001ca0930, 0xc005394ae0, 0x1c, 0xc001a15e60, 0xc00173f618)
        runtime/map_faststr.go:116 +0x4a5 fp=0xc0009b9590 sp=0xc0009b9520 pc=0x414505
github.com/pteich/caddy-tlsconsul.ConsulStorage.Unlock(0x0, 0x0, 0xc00126b680, 0xc00008e270, 0xc001ca0930, 0xc0011548a0, 0x19, 0x0, 0x0, 0xa, ...)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:87 +0x66 fp=0xc0009b9670 sp=0xc0009b9590 pc=0x14f70a6
github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock.func1(0xc00173f5c0, 0xc002152d80, 0xc005394ae0, 0x1c)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:75 +0xb8 fp=0xc0009b97c0 sp=0xc0009b9670 pc=0x14fa3d8
runtime.goexit()
        runtime/asm_amd64.s:1371 +0x1 fp=0xc0009b97c8 sp=0xc0009b97c0 pc=0x4728a1
created by github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:73 +0x68e

goroutine 1 [select (no cases), 2384 minutes]:
github.com/caddyserver/caddy/v2/cmd.cmdRun(0xc0001a97a0, 0x0, 0x0, 0x0)
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/commandfuncs.go:276 +0x1395
github.com/caddyserver/caddy/v2/cmd.Main()
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/main.go:85 +0x25b
main.main()
        caddy/main.go:15 +0x25

goroutine 24 [select, 2384 minutes]:
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit(0xc0001b7d60)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:216 +0xb2
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop(0xc0001b7d60)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:89 +0xa8
created by github.com/caddyserver/certmagic.NewRateLimiter
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:45 +0x148

@dazoot
Copy link
Author

dazoot commented Jul 1, 2021

@pteich can we sponsor this issue ?

@pteich
Copy link
Owner

pteich commented Jul 1, 2021

Hi @dazoot - I'm really sorry, but I probably missed the issue notification in all the fuzz 😢
That should not happen and I'll have a look if it's just an issue that I can fix with a mutex or it something upstream (Consul lib) related.

@pteich pteich closed this as completed in edb4786 Jul 5, 2021
@pteich
Copy link
Owner

pteich commented Jul 5, 2021

@dazoot Thanks for reporting this problem. There was indeed a map that was not secured for concurrent read writes. This is fixed now. I'll create a new version v1.3.3 that includes this fix.

@pteich pteich self-assigned this Jul 5, 2021
@pteich pteich added the bug label Jul 5, 2021
@dazoot
Copy link
Author

dazoot commented Jul 6, 2021

It seems now the tls register / renew certificate is stuck in acquiring lock:

{"level":"info","ts":1625564182.286606,"logger":"tls.obtain","msg":"acquiring lock","identifier":"nl.domain.ro"}

@pteich
Copy link
Owner

pteich commented Jul 6, 2021

Strange, it worked on my local setup but I'll check it out on a larger installation.

@pteich pteich reopened this Jul 6, 2021
pteich added a commit that referenced this issue Jul 6, 2021
@pteich
Copy link
Owner

pteich commented Jul 7, 2021

@dazoot I'll introduced a fix with another new version v1.3.4 that I had now running over 12hours and successfully received new certificates. I now consider this bug finally resolved.

@pteich pteich closed this as completed Jul 7, 2021
@dazoot
Copy link
Author

dazoot commented Jul 18, 2021

After a couple of weeks i have noticed that we have stuck certificates still. After checking the nodes i see that they get stuck waiting for the lock.

{"level":"info","ts":1626567854.1713824,"logger":"tls.renew","msg":"acquiring lock","identifier":"cdn.domain.ro"}

Can this lock acquire be set to timeout after a while ?

@pteich
Copy link
Owner

pteich commented Jul 18, 2021

This is a log message from Caddy (or better certmagic) but you don't see any errors or messages after this appears?
I probably have to add some debug logs to get an impression if the Consul plugin is even called and where it breaks.
By now I have a both, a local lock for quickly checking if the instance already has the lock (this is the one with concurrency problems before) and the real distributed Consul locks. I can add a lock wait time for getting this Consul lock. Maybe in some situations it just takes too long to get this lock or Consul is not responsive and everything get stuck.

I'll add this. Maybe as a new config option. And I'll also add debug logs to get more helpful messages in such cases.

@dazoot
Copy link
Author

dazoot commented Jul 18, 2021

    {"level":"info","ts":1626603361.6669,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["nl.dom.ro"],"remaining":2588554.333102271}
    {"level":"info","ts":1626603361.858531,"logger":"tls.renew","msg":"acquiring lock","identifier":"nl.dom.ro"}

I have set all nodes except one in Caddy with renew_interval very high (ex: 15d) so just one node is actually doing certificates renew.

Even when single node is used the lock acquire never comes. Stuck in limbo :)

{"level":"info","ts":1626596162.0471478,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2591886.952852568}
{"level":"info","ts":1626596162.0472286,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["img.dom.ro"],"remaining":2591886.952771762}
{"level":"info","ts":1626596162.3396833,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}
{"level":"info","ts":1626599762.3353257,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2588286.66467504}
{"level":"info","ts":1626603361.6667585,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2584687.333243155}
{"level":"info","ts":1626606961.9950764,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2581087.004924088}
{"level":"info","ts":1626607212.4525259,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}

@dazoot
Copy link
Author

dazoot commented Jul 18, 2021

Is the real distributed Consul lock required ?
Maybe we can have it optional.

pteich added a commit that referenced this issue Jul 18, 2021
Try to fix deadlock from #18
@pteich
Copy link
Owner

pteich commented Jul 18, 2021

For my understanding it is needed in a Caddy cluster to share the lock across the instances so that only one instance can renew or apply for new certificates.

I've changed the code in master to just try once to acquire the lock in Consul and otherwise just fail. I think that is enough for that use case. If it gets the lock - everything is ok. If not, probably another Caddy has it (or an error occurred) and therefor it's ok to return an error. So there is no need to wait forever to get it.

The code now also uses the timeout in seconds than can already be configured for lock wait timeouts:

storage consul {
// ...
   timeout      10
// ...
}

@dazoot
Copy link
Author

dazoot commented Jul 18, 2021

Great. Can you create a release pls ?

@pteich
Copy link
Owner

pteich commented Jul 18, 2021

Done!

Out of curiosity: How many domains and requests do your cluster roughly serve? I have nearly 2.5Mill hits/day with a 4 Caddy cluster (but only <100 domains) and never ran into similar problems. So thanks for your interesting findings that helped improving ;)

@dazoot
Copy link
Author

dazoot commented Jul 18, 2021

We have about 4000 domains. Far less requests (link rewrite mostly). About 7 nodes. Will try it out soon. Thanks.

@dazoot
Copy link
Author

dazoot commented Jul 19, 2021

Still hangs.
We have to renew ~50 certs.
After restart it renews about 5, then i see a couple of "tls.renew","msg":"acquiring lock" log messages and nothing.

Does not seem to reach the consul locking part (ex: the timeout or the logging you have added).

@pteich
Copy link
Owner

pteich commented Jul 19, 2021

Strange. I've added some simple debug messages in this version that at least show if it reaches my code. But I'm not sure how to enable debug log in Caddy.

@dazoot
Copy link
Author

dazoot commented Jul 19, 2021

Can it be the local locking and not the Consul lokcing which blocks ?

@dazoot
Copy link
Author

dazoot commented Jul 19, 2021

And i think the sink logging in Caddy uses INFO as log level. I did not find a way to change it. Can we change it in the module from DEBUG to INFO ? In general there should not be that much noise form this module so INFO would be ok.

@pteich
Copy link
Owner

pteich commented Jul 19, 2021

I'll change it to info logging. Hopefully we can at least where it got stuck.

@dazoot
Copy link
Author

dazoot commented Jul 19, 2021

I did some local tests (changed Debugf to Infof).

It does not go over the local locking. It does not reach the Consul part.

cat /var/log/caddyserver.log | grep caddy.storage | grep -v "loading data from"  | grep scuba
{"level":"info","ts":1626716859.5458074,"logger":"caddy.storage.consul","msg":"trying lock for issue_cert_nl.scubadomain.tld"}

There should be next an attempt to create the Consul lock:

func (cs *ConsulStorage) Lock(ctx context.Context, key string) error {
	cs.logger.Infof("trying lock for %s", key)

	if _, isLocked := cs.GetLock(key); isLocked {
		return nil
	}

	// prepare the distributed lock
	cs.logger.Infof("creating Consul lock for %s", key)

@pteich
Copy link
Owner

pteich commented Jul 19, 2021

I'm pretty sure it got stuck because another process holds the mutex. The reason could be the unlock function.

I've change the code for this and created a new release. I also switched from debug to info as you did locally.

@dazoot
Copy link
Author

dazoot commented Jul 19, 2021

New release ?

@pteich
Copy link
Owner

pteich commented Jul 19, 2021

I've already created v1.3.6

@dazoot
Copy link
Author

dazoot commented Jul 19, 2021

Seems ok now. All certs were renewed but with Zerossl (fallback). Letsencrypt was down a while ago.

So now i have for some hosts 2 certs. One from LE which expired in 20 days and a fresh one from Zerossl.

Is this handled by Caddy or this module ?

@pteich
Copy link
Owner

pteich commented Jul 19, 2021

This is handled by Certmagic inside Caddy and should be no problem (at least I had the same some time ago with some domains). This module just loads and saves the data by request.

@dazoot
Copy link
Author

dazoot commented Jul 19, 2021

So the expired cert will be deleted eventually ?

@pteich
Copy link
Owner

pteich commented Jul 19, 2021

Exactly, the Certmagic Storage interface contains a Delete function for this and this module implements it.

pteich added a commit that referenced this issue Jan 3, 2022
pteich added a commit that referenced this issue Jan 3, 2022
pteich added a commit that referenced this issue Jan 3, 2022
Try to fix deadlock from #18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants