Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in leader loop when stopping autopilot #2896

Closed
slackpad opened this issue Apr 12, 2017 · 2 comments · Fixed by #2911
Closed

Panic in leader loop when stopping autopilot #2896

slackpad opened this issue Apr 12, 2017 · 2 comments · Fixed by #2911
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Milestone

Comments

@slackpad
Copy link
Contributor

As seen in https://twitter.com/sLoK69/status/852144938108301312.

image

@slackpad slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Apr 12, 2017
@slackpad slackpad added this to the 0.8.1 milestone Apr 12, 2017
@slok
Copy link

slok commented Apr 12, 2017

Here is the proper traceback:

2017/04/11 05:13:48 [WARN] memberlist: Refuting a suspect message (from: i-0903fe65a252b119e)
2017/04/11 05:13:48 [INFO] serf: EventMemberJoin: i-0903fe65a252b119e.eu-west-1 10.0.2.132
2017/04/11 05:13:48 [INFO] consul: Handled member-join event for server "i-0903fe65a252b119e.eu-west-1" in area "wan"
2017/04/11 05:13:50 [WARN] raft: Rejecting vote request from 10.0.2.132:8300 since we have a leader: 10.0.1.250:8300
2017/04/11 05:13:51 [INFO] serf: EventMemberJoin: i-0903fe65a252b119e 10.0.2.132
2017/04/11 05:13:51 [INFO] consul: Adding LAN server i-0903fe65a252b119e (Addr: tcp/10.0.2.132:8300) (DC: eu-west-1)
2017/04/11 05:13:51 [INFO] raft: Updating configuration with AddNonvoter (14ecfa36-ddef-46dd-94f8-cabb83fd2ebd, 10.0.2.132:8300) to [{Suffrage:Voter ID:47c692d3-0c0a-4a73-8619-307ed0f568ba Address:10.0.1.253:8300} {Suffrage:Voter ID:b842ce72-59ee-4afb-839f-708001b2e5dd Address:10.0.1.164:8300} {Suffrage:Voter ID:eecb8cdb-c27b-4fbb-8a21-371382e243d4 Address:10.0.1.250:8300} {Suffrage:Voter ID:3b49b8f4-18c7-4495-8173-28923a1bdb57 Address:10.0.2.25:8300} {Suffrage:Nonvoter ID:14ecfa36-ddef-46dd-94f8-cabb83fd2ebd Address:10.0.2.132:8300}]
2017/04/11 05:13:51 [INFO] raft: Added peer 14ecfa36-ddef-46dd-94f8-cabb83fd2ebd, starting replication
2017/04/11 05:13:51 [ERR] raft: peer {Nonvoter 14ecfa36-ddef-46dd-94f8-cabb83fd2ebd 10.0.2.132:8300} has newer term, stopping replication
2017/04/11 05:13:51 [INFO] raft: Node at 10.0.1.250:8300 [Follower] entering Follower state (Leader: "")
2017/04/11 05:13:51 [INFO] consul: cluster leadership lost
2017/04/11 05:13:51 [INFO] raft: aborting pipeline replication to peer {Voter 3b49b8f4-18c7-4495-8173-28923a1bdb57 10.0.2.25:8300}
2017/04/11 05:13:51 [INFO] raft: aborting pipeline replication to peer {Voter 47c692d3-0c0a-4a73-8619-307ed0f568ba 10.0.1.253:8300}
2017/04/11 05:13:51 [INFO] raft: aborting pipeline replication to peer {Voter b842ce72-59ee-4afb-839f-708001b2e5dd 10.0.1.164:8300}
2017/04/11 05:13:51 [ERR] consul: failed to add raft peer: leadership lost while committing log
2017/04/11 05:13:51 [ERR] consul: failed to reconcile member: {i-0903fe65a252b119e 10.0.2.132 8301 map[build:0.8.0:'402636f expect:5 id:14ecfa36-ddef-46dd-94f8-cabb83fd2ebd vsn_min:2 vsn_max:3 raft_vsn:3 vsn:2 wan_join_port:8302 role:consul dc:eu-west-1 port:8300] alive 1 5 2 2 5 4}: leadership lost while committing log
2017/04/11 05:13:55 [WARN] consul.coordinate: Batch update failed: node is not the leader
2017/04/11 05:13:58 [WARN] raft: Heartbeat timeout from "" reached, starting election
2017/04/11 05:13:58 [INFO] raft: Node at 10.0.1.250:8300 [Candidate] entering Candidate state in term 13
2017/04/11 05:13:58 [INFO] raft: Election won. Tally: 3
2017/04/11 05:13:58 [INFO] raft: Node at 10.0.1.250:8300 [Leader] entering Leader state
2017/04/11 05:13:58 [INFO] raft: Added peer 47c692d3-0c0a-4a73-8619-307ed0f568ba, starting replication
2017/04/11 05:13:58 [INFO] raft: Added peer b842ce72-59ee-4afb-839f-708001b2e5dd, starting replication
2017/04/11 05:13:58 [INFO] raft: Added peer 3b49b8f4-18c7-4495-8173-28923a1bdb57, starting replication
2017/04/11 05:13:58 [INFO] raft: Added peer 14ecfa36-ddef-46dd-94f8-cabb83fd2ebd, starting replication
2017/04/11 05:13:58 [INFO] consul: cluster leadership acquired
2017/04/11 05:13:58 [INFO] consul: New leader elected: i-08b5ba5b799fb09d6
2017/04/11 05:13:58 [INFO] raft: pipelining replication to peer {Voter b842ce72-59ee-4afb-839f-708001b2e5dd 10.0.1.164:8300}
2017/04/11 05:13:58 [INFO] raft: pipelining replication to peer {Voter 47c692d3-0c0a-4a73-8619-307ed0f568ba 10.0.1.253:8300}
2017/04/11 05:13:58 [ERR] raft: peer {Nonvoter 14ecfa36-ddef-46dd-94f8-cabb83fd2ebd 10.0.2.132:8300} has newer term, stopping replication
2017/04/11 05:13:58 [INFO] raft: pipelining replication to peer {Voter 3b49b8f4-18c7-4495-8173-28923a1bdb57 10.0.2.25:8300}
2017/04/11 05:13:58 [INFO] raft: Node at 10.0.1.250:8300 [Follower] entering Follower state (Leader: "")
2017/04/11 05:13:58 [INFO] consul: cluster leadership lost
2017/04/11 05:13:58 [INFO] raft: aborting pipeline replication to peer {Voter 47c692d3-0c0a-4a73-8619-307ed0f568ba 10.0.1.253:8300}
2017/04/11 05:13:58 [INFO] raft: aborting pipeline replication to peer {Voter b842ce72-59ee-4afb-839f-708001b2e5dd 10.0.1.164:8300}
2017/04/11 05:13:58 [INFO] raft: aborting pipeline replication to peer {Voter 3b49b8f4-18c7-4495-8173-28923a1bdb57 10.0.2.25:8300}
2017/04/11 05:13:58 [ERR] consul: failed to wait for barrier: leadership lost while committing log
panic: close of closed channel

goroutine 1682202 [running]:
github.com/hashicorp/consul/consul.(*Server).stopAutopilot(0xc4201c3900)
        /gopath/src/github.com/hashicorp/consul/consul/autopilot.go:32 +0x2f
github.com/hashicorp/consul/consul.(*Server).revokeLeadership(0xc4201c3900, 0xf, 0x10)
        /gopath/src/github.com/hashicorp/consul/consul/leader.go:180 +0x106
github.com/hashicorp/consul/consul.(*Server).leaderLoop(0xc4201c3900, 0xc420edd500)
        /gopath/src/github.com/hashicorp/consul/consul/leader.go:109 +0x50f
created by github.com/hashicorp/consul/consul.(*Server).monitorLeadership
        /gopath/src/github.com/hashicorp/consul/consul/leader.go:39 +0x190

@slackpad
Copy link
Contributor Author

I think the root cause here is that monitorLeadership runs in a different goroutine than leaderLoop, so it's possible we have contention around autopilot stop and start. We could add a lock, but that still seems like we can get into trouble with the old stop getting processed after the new start - will take a more detailed look at this tomorrow after #2897 is merged.

slackpad added a commit that referenced this issue Apr 13, 2017

Verified

This commit was signed with the committer’s verified signature.
slackpad James Phillips
This fixes #2896 by switching to the `notifyCh` instead of the `leaderCh`,
so we get all up/down events from Raft regarding leadership. We also wait
for the old leader loop to shut down before we ever consider starting a
new one, which keeps that single-threaded and fixes the panic in that issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants