-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Barrier write timeout can cause permanently degraded leader #3545
Milestone
Comments
slackpad
added a commit
that referenced
this issue
Oct 6, 2017
There were two issues here. First, we needed to not exit when there was a timeout trying to write the barrier, because Raft might not step down, so we'd be left as the leader but having run all the step down actions. Second, we didn't close over the stopCh correctly, so it was possible to nil that out and have the leaderLoop never exit. We close over it properly AND sequence the nil-ing of it AFTER the leaderLoop exits for good measure, so the code is more robust. Fixes #3545
slackpad
added a commit
that referenced
this issue
Oct 6, 2017
* Fixes handling of stop channel and failed barrier attempts. There were two issues here. First, we needed to not exit when there was a timeout trying to write the barrier, because Raft might not step down, so we'd be left as the leader but having run all the step down actions. Second, we didn't close over the stopCh correctly, so it was possible to nil that out and have the leaderLoop never exit. We close over it properly AND sequence the nil-ing of it AFTER the leaderLoop exits for good measure, so the code is more robust. Fixes #3545 * Cleans up based on code review feedback. * Tweaks comments. * Renames variables and removes comments.
slackpad
added a commit
to hashicorp/nomad
that referenced
this issue
Oct 17, 2017
There was a deadlock issue we fixed under hashicorp/consul#3230, and then discovered an issue with under hashicorp/consul#3545. This PR ports over those fixes, as well as makes the revoke actions only happen if leadership was established. This brings the Nomad leader loop inline with Consul's.
I would like to upgrade my 0.7.5 cluster to version 0.9.3. |
Hi @cshabi we don't have plans to backport this fix, so you'd need to cherry pick it into a local branch and build Consul. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We encountered a user where the Consul leader ran into a network issue that lasted several minutes. Over the span of about 20 minutes there were errors queueing writes, with messages like
consul.coordinate: Batch update failed: timed out enqueuing operation
. Finally, one of these happened during a leader reconcile loop and we sawconsul: failed to wait for barrier: timed out enqueuing operation
.Since the leader had been established for several days, this hit the
return
statement in the error handling logic here:https://github.com/hashicorp/consul/blob/v1.0.0-beta2/agent/consul/leader.go#L97-L101
Which hit the
defer
-edrevokeLeadership()
call here:https://github.com/hashicorp/consul/blob/v1.0.0-beta2/agent/consul/leader.go#L111-L116
The result was that the leader loop exited and that the consistent read flag got reset, resulting in the application getting
Not ready to serve consistent reads
. Since the server never actually lost leadership it was stuck in this state. Restarting the leader fixed it. This was found in 0.9.2, but the code is the same in master today.The change that added the
return
was #3230, which was working under the assumption that exiting the leader loop would cause the server to step down. In this unfortunate set of circumstances, the network errors were such that we hit the barrier write error, but we were able to service the Raft requirements to stay a leader. This is a bit of a manifestation of the separate path for pipelined Raft replication + a separate path for AE messages for maintaining leadership (I think), so that stems from our implementation.In any case, the right thing to do here is to keep the timeout part of the deadlock fix from #3230, but do the old
goto WAIT
instead of thereturn
. This would have let the leader make another attempt at the barrier later, and would have kept the leader tracking goroutine.There's no good API now to ask Raft to step down, so later we might want to consider adding that, and if the leader tracking goroutine hasn't made progress then we could voluntarily step down, but the simple fix outlined above should be sufficient to defer to Raft's health checking. There are no other
return
statements of this type inleaderLoop()
.The text was updated successfully, but these errors were encountered: