Barrier write timeout can cause permanently degraded leader #3545

slackpad · 2017-10-05T21:51:43Z

We encountered a user where the Consul leader ran into a network issue that lasted several minutes. Over the span of about 20 minutes there were errors queueing writes, with messages like consul.coordinate: Batch update failed: timed out enqueuing operation. Finally, one of these happened during a leader reconcile loop and we saw consul: failed to wait for barrier: timed out enqueuing operation.

Since the leader had been established for several days, this hit the return statement in the error handling logic here:

https://github.com/hashicorp/consul/blob/v1.0.0-beta2/agent/consul/leader.go#L97-L101

Which hit the defer-ed revokeLeadership() call here:

https://github.com/hashicorp/consul/blob/v1.0.0-beta2/agent/consul/leader.go#L111-L116

The result was that the leader loop exited and that the consistent read flag got reset, resulting in the application getting Not ready to serve consistent reads. Since the server never actually lost leadership it was stuck in this state. Restarting the leader fixed it. This was found in 0.9.2, but the code is the same in master today.

The change that added the return was #3230, which was working under the assumption that exiting the leader loop would cause the server to step down. In this unfortunate set of circumstances, the network errors were such that we hit the barrier write error, but we were able to service the Raft requirements to stay a leader. This is a bit of a manifestation of the separate path for pipelined Raft replication + a separate path for AE messages for maintaining leadership (I think), so that stems from our implementation.

In any case, the right thing to do here is to keep the timeout part of the deadlock fix from #3230, but do the old goto WAIT instead of the return. This would have let the leader make another attempt at the barrier later, and would have kept the leader tracking goroutine.

There's no good API now to ask Raft to step down, so later we might want to consider adding that, and if the leader tracking goroutine hasn't made progress then we could voluntarily step down, but the simple fix outlined above should be sufficient to defer to Raft's health checking. There are no other return statements of this type in leaderLoop().

The text was updated successfully, but these errors were encountered:

There were two issues here. First, we needed to not exit when there was a timeout trying to write the barrier, because Raft might not step down, so we'd be left as the leader but having run all the step down actions. Second, we didn't close over the stopCh correctly, so it was possible to nil that out and have the leaderLoop never exit. We close over it properly AND sequence the nil-ing of it AFTER the leaderLoop exits for good measure, so the code is more robust. Fixes #3545

* Fixes handling of stop channel and failed barrier attempts. There were two issues here. First, we needed to not exit when there was a timeout trying to write the barrier, because Raft might not step down, so we'd be left as the leader but having run all the step down actions. Second, we didn't close over the stopCh correctly, so it was possible to nil that out and have the leaderLoop never exit. We close over it properly AND sequence the nil-ing of it AFTER the leaderLoop exits for good measure, so the code is more robust. Fixes #3545 * Cleans up based on code review feedback. * Tweaks comments. * Renames variables and removes comments.

There was a deadlock issue we fixed under hashicorp/consul#3230, and then discovered an issue with under hashicorp/consul#3545. This PR ports over those fixes, as well as makes the revoke actions only happen if leadership was established. This brings the Nomad leader loop inline with Consul's.

cshabi · 2017-10-31T15:39:03Z

I would like to upgrade my 0.7.5 cluster to version 0.9.3.
Is it possible to provide a separate package for that version with a fix for this bug?

slackpad · 2017-10-31T17:29:14Z

Hi @cshabi we don't have plans to backport this fix, so you'd need to cherry pick it into a local branch and build Consul.

slackpad added this to the 1.0 milestone Oct 5, 2017

slackpad mentioned this issue Oct 6, 2017

Fixes handling of stop channel and failed barrier attempts. #3546

Merged

magiconair assigned slackpad Oct 6, 2017

slackpad closed this as completed in #3546 Oct 6, 2017

slackpad mentioned this issue Oct 12, 2017

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader #3552

Closed

slackpad mentioned this issue Oct 17, 2017

Applies leader loop fixes from Consul. hashicorp/nomad#3402

Merged

slackpad mentioned this issue Jan 9, 2018

Consul write api hangs for half hour before return rpc error #3738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Barrier write timeout can cause permanently degraded leader #3545

Barrier write timeout can cause permanently degraded leader #3545

slackpad commented Oct 5, 2017 •

edited

Loading

cshabi commented Oct 31, 2017 •

edited

Loading

slackpad commented Oct 31, 2017

Barrier write timeout can cause permanently degraded leader #3545

Barrier write timeout can cause permanently degraded leader #3545

Comments

slackpad commented Oct 5, 2017 • edited Loading

cshabi commented Oct 31, 2017 • edited Loading

slackpad commented Oct 31, 2017

slackpad commented Oct 5, 2017 •

edited

Loading

cshabi commented Oct 31, 2017 •

edited

Loading