-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes locking up #206
Comments
Config:
|
Do you know what build you were running previously? |
Are the nodes that lock-up totally inoperable? It would be great if you could try to capture the output of sending a |
Btw these only print to stdout and not syslog which luckily isn't too bad of a problem since upstart on precise logs stdout automatically. Sadly upstart on lucid doesn't so I have to use syslog for consistent logging. Metrics and stracktrace: https://gist.github.com/tarrant/4d62bd6db4d9434ed8f4 |
Yeah they are super debug only, directly to stderr. The |
Do you have something fancy going on with syslog? It seems that Consul is blocked writing to syslog. One goroutine is waiting to write to syslog, while holding a lock inside the Logger, so any goroutine that tries to log anything ends up deadlocking. |
As far as I can tell no nothing fancy. But it makes sense that the issue is in syslog since that is the reason I upgraded and the only config that I've changed when upgrading. I'll poke around our syslog settings and see if I can discover anything. |
Let me know what you find. Otherwise, our only other choices are to either buffer logs indefinitely (bad) or just drop logs (worse). |
I can't find anything odd in our syslog setup. I'm attempting to replicate the issue with a small go program and the go-syslog library. |
I'll be trying to replicate tomorrow as well. Hammered a perf cluster for a bit, but never saw this. I wasn't using syslog however. |
We only see this issue in our main cluster, our smaller clusters seem to be happy. This makes me believe it is a locking issue triggered by serf messages about nodes joining/leaving/etc as this cluster sees significantly more of these messages. |
Did you see the same number of serf messages before the update? Or has there been an uptick in the new build? |
@tarrant On thinking about this more, I think a few more pieces are clear:
So, I think the root issue here is the blocking syslog. The other pieces of this make sense. |
As followup, seems we are not alone in this. Golang bug with syslog: |
Just updated this: hashicorp/go-syslog@ac3963b When you get a chance, can you see if this has fixed the issue? |
So far that seems to have fixed the issue. |
Super annoying that this issue exists, but glad its fixed! |
I just upgraded to commit 6d43b8c and am now seeing a small percentage of nodes constantly failing their serf checks. There is no consistency between which nodes fail out. Restarting consul will resolve the issue, temporarily.
Consul is running:
But
consul members
and other api requests hang, not responding.There is nothing in the logs beyond messages like:
The text was updated successfully, but these errors were encountered: