-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fatal error: concurrent map read and map write #18
Comments
Any hint on this one ? |
Now it happend when getting a new certificate first time:
|
@pteich can we sponsor this issue ? |
Hi @dazoot - I'm really sorry, but I probably missed the issue notification in all the fuzz 😢 |
@dazoot Thanks for reporting this problem. There was indeed a map that was not secured for concurrent read writes. This is fixed now. I'll create a new version v1.3.3 that includes this fix. |
It seems now the tls register / renew certificate is stuck in
|
Strange, it worked on my local setup but I'll check it out on a larger installation. |
@dazoot I'll introduced a fix with another new version v1.3.4 that I had now running over 12hours and successfully received new certificates. I now consider this bug finally resolved. |
After a couple of weeks i have noticed that we have stuck certificates still. After checking the nodes i see that they get stuck waiting for the lock.
Can this lock acquire be set to timeout after a while ? |
This is a log message from Caddy (or better certmagic) but you don't see any errors or messages after this appears? I'll add this. Maybe as a new config option. And I'll also add debug logs to get more helpful messages in such cases. |
I have set all nodes except one in Caddy with Even when single node is used the lock acquire never comes. Stuck in limbo :)
|
Is the real distributed Consul lock required ? |
For my understanding it is needed in a Caddy cluster to share the lock across the instances so that only one instance can renew or apply for new certificates. I've changed the code in master to just try once to acquire the lock in Consul and otherwise just fail. I think that is enough for that use case. If it gets the lock - everything is ok. If not, probably another Caddy has it (or an error occurred) and therefor it's ok to return an error. So there is no need to wait forever to get it. The code now also uses the
|
Great. Can you create a release pls ? |
Done! Out of curiosity: How many domains and requests do your cluster roughly serve? I have nearly 2.5Mill hits/day with a 4 Caddy cluster (but only <100 domains) and never ran into similar problems. So thanks for your interesting findings that helped improving ;) |
We have about 4000 domains. Far less requests (link rewrite mostly). About 7 nodes. Will try it out soon. Thanks. |
Still hangs. Does not seem to reach the |
Strange. I've added some simple debug messages in this version that at least show if it reaches my code. But I'm not sure how to enable debug log in Caddy. |
Can it be the local locking and not the Consul lokcing which blocks ? |
And i think the |
I'll change it to info logging. Hopefully we can at least where it got stuck. |
I did some local tests (changed Debugf to Infof). It does not go over the local locking. It does not reach the Consul part.
There should be next an attempt to create the Consul lock:
|
I'm pretty sure it got stuck because another process holds the mutex. The reason could be the unlock function. I've change the code for this and created a new release. I also switched from debug to info as you did locally. |
New release ? |
I've already created v1.3.6 |
Seems ok now. All certs were renewed but with Zerossl (fallback). Letsencrypt was down a while ago. So now i have for some hosts 2 certs. One from LE which expired in 20 days and a fresh one from Zerossl. Is this handled by Caddy or this module ? |
This is handled by Certmagic inside Caddy and should be no problem (at least I had the same some time ago with some domains). This module just loads and saves the data by request. |
So the expired cert will be deleted eventually ? |
Exactly, the |
One of our caddy instances was renewing ~5 certificates and we got this error in the log:
The whole caddy process died.
Any hint of what could have gone wrong ?
The text was updated successfully, but these errors were encountered: