-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes handling of stop channel and failed barrier attempts. #3546
Conversation
There were two issues here. First, we needed to not exit when there was a timeout trying to write the barrier, because Raft might not step down, so we'd be left as the leader but having run all the step down actions. Second, we didn't close over the stopCh correctly, so it was possible to nil that out and have the leaderLoop never exit. We close over it properly AND sequence the nil-ing of it AFTER the leaderLoop exits for good measure, so the code is more robust. Fixes #3545
agent/consul/leader.go
Outdated
var wg sync.WaitGroup | ||
var stopCh chan struct{} | ||
for { | ||
select { | ||
case isLeader := <-raftNotifyCh: | ||
if isLeader { | ||
if stopCh != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct. I have three small nitpicks:
- I'd call
defer wg.Done
out of habit in case the function becomes more complex - I'd name go func arg different (e.g.
ch
) to avoid accidental future confusion - I think writing this as a
switch
statement may help readability since it gets rid of some brackets
switch {
case isLeader:
if stopCh != nil {
s.logger.Printf("[ERR] consul: attempted to start the leader loop while running")
continue
}
stopCh = make(chan struct{})
wg.Add(1)
go func(ch chan struct{}) {
defer wg.Done()
s.leaderLoop(ch)
}(stopCh)
s.logger.Printf("[INFO] consul: cluster leadership acquired")
default:
if stopCh == nil {
s.logger.Printf("[ERR] consul: attempted to stop the leader loop while not running")
continue
}
s.logger.Printf("[DEBUG] consul: shutting down leader loop")
close(stopCh)
wg.Wait()
stopCh = nil
s.logger.Printf("[INFO] consul: cluster leadership lost")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @magiconair these are all good suggestions. PTAL at the last push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There were two issues here. First, we needed to not exit when there was a timeout trying to write the barrier, because Raft might not step down, so we'd be left as the leader but having run all the step down actions.
Second, we didn't close over the
stopCh
correctly, so it was possible tonil
that out and have theleaderLoop
never exit. We close over it properly AND sequence thenil
-ing of it AFTER theleaderLoop
exits for good measure, so the code is more robust.We also added a pre-poll before we wait in the
leaderLoop
, since the exit condition is mixed in with a bunch of other stuff, so if we wait a long time for the barrier we will have hit the interval and may get kind of stuck waiting for theselect
to pick thestopCh
case (making another failed attempt at doing a barrier could cost us another 2 minutes, for example).Fixes #3545