-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KV tombstone GC can deadlock and stall all writes to a cluster #3700
Comments
The AE run stuck on I/O looks like it's holding the lock and causing the other goroutines to block, so this current trace seems like a symptom of running out of FDs and not necessarily the cause. |
Ugh - this is a nasty one. Running another test with instrumentation on each server (periodic The leader loop is waiting to commit the barrier, so it's not servicing the tombstone GC channel:
Meanwhile a GC timer fired and holds the tombstone GC lock, and is waiting to queue something into the channel above:
But there's a KVS-related FSM operation running that's stuck on the tombstone GC lock:
And that's keeping the barrier from going through, so we have a deadlock. It seems like the simplest fix here would be to prevent the |
So the /v1/agent/self hang is because Raft was tied up with the issue above - it was a symptom but not the cause. |
For anybody like me wondering how to obtain the goroutines stack trace dump as in the issue description, set |
While testing Consul 1.0.1 pre-release we discovered a situation where the Consul agent would hang when resolving /v1/agent/self. In this case it was Nomad trying to hit it periodically that triggered the behavior.
We found the following stuck handler:
And it looks like there's a check update stuck here:
And a reap stuck here:
And an AE run stuck here:
Here's the full gist of the stack traces for the agent - https://gist.github.com/slackpad/0e2e55d5d4656f82b458ef719ffcd6c1.
The text was updated successfully, but these errors were encountered: