Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes handling of stop channel and failed barrier attempts. #3546

Merged
merged 4 commits into from
Oct 6, 2017

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Oct 6, 2017

There were two issues here. First, we needed to not exit when there was a timeout trying to write the barrier, because Raft might not step down, so we'd be left as the leader but having run all the step down actions.

Second, we didn't close over the stopCh correctly, so it was possible to nil that out and have the leaderLoop never exit. We close over it properly AND sequence the nil-ing of it AFTER the leaderLoop exits for good measure, so the code is more robust.

We also added a pre-poll before we wait in the leaderLoop, since the exit condition is mixed in with a bunch of other stuff, so if we wait a long time for the barrier we will have hit the interval and may get kind of stuck waiting for the select to pick the stopCh case (making another failed attempt at doing a barrier could cost us another 2 minutes, for example).

Fixes #3545

There were two issues here. First, we needed to not exit when there
was a timeout trying to write the barrier, because Raft might not
step down, so we'd be left as the leader but having run all the step
down actions.

Second, we didn't close over the stopCh correctly, so it was possible
to nil that out and have the leaderLoop never exit. We close over it
properly AND sequence the nil-ing of it AFTER the leaderLoop exits for
good measure, so the code is more robust.

Fixes #3545
var wg sync.WaitGroup
var stopCh chan struct{}
for {
select {
case isLeader := <-raftNotifyCh:
if isLeader {
if stopCh != nil {
Copy link
Contributor

@magiconair magiconair Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct. I have three small nitpicks:

  1. I'd call defer wg.Done out of habit in case the function becomes more complex
  2. I'd name go func arg different (e.g. ch) to avoid accidental future confusion
  3. I think writing this as a switch statement may help readability since it gets rid of some brackets
switch {
case isLeader:
	if stopCh != nil {
		s.logger.Printf("[ERR] consul: attempted to start the leader loop while running")
		continue
	}

	stopCh = make(chan struct{})
	wg.Add(1)
	go func(ch chan struct{}) {
		defer wg.Done()
		s.leaderLoop(ch)
	}(stopCh)
	s.logger.Printf("[INFO] consul: cluster leadership acquired")

default:
	if stopCh == nil {
		s.logger.Printf("[ERR] consul: attempted to stop the leader loop while not running")
		continue
	}

	s.logger.Printf("[DEBUG] consul: shutting down leader loop")
	close(stopCh)
	wg.Wait()
	stopCh = nil
	s.logger.Printf("[INFO] consul: cluster leadership lost")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @magiconair these are all good suggestions. PTAL at the last push.

@magiconair magiconair added this to the 1.0 milestone Oct 6, 2017
Copy link
Contributor

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@slackpad slackpad merged commit 4dab70c into master Oct 6, 2017
@slackpad slackpad deleted the issue-3545 branch October 6, 2017 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Barrier write timeout can cause permanently degraded leader
3 participants