Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in leader loop when stopping autopilot #2980

Closed
Gufran opened this issue Apr 28, 2017 · 6 comments
Closed

Panic in leader loop when stopping autopilot #2980

Gufran opened this issue Apr 28, 2017 · 6 comments
Assignees
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace

Comments

@Gufran
Copy link
Contributor

Gufran commented Apr 28, 2017

consul version for both Client and Server

Client: v0.8.1
Server: v0.8.1

consul info for both Client and Server

A client is not involved.

Server:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease =
        revision = 'e9ca44d
        version = 0.8.1
consul:
        bootstrap = false
        known_datacenters = 2
        leader = false
        leader_addr = 10.0.13.24:8300
        server = true
raft:
        applied_index = 482776383
        commit_index = 482776383
        fsm_pending = 0
        last_contact = 48.941638ms
        last_log_index = 482776386
        last_log_term = 2709
        last_snapshot_index = 482775776
        last_snapshot_term = 2709
        latest_configuration = [{Suffrage:Voter ID:10.0.13.24:8300 Address:10.0.13.24:8300} {Suffrage:Voter ID:10.0.13.161:8300 Address:10.0.13.161:8300} {Suffrage:Voter ID:10.0.13.84:8300 Address:10.0.13.84:8300} {Suffrage:Voter ID:10.0.13.231:8300 Address:10.0.13.231:8300} {Suffrage:Voter ID:10.0.13.52:8300 Address:10.0.13.52:8300}]
        latest_configuration_index = 482581927
        num_peers = 4
        protocol_version = 2
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 2709
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 204
        max_procs = 2
        os = linux
        version = go1.8.1
serf_lan:
        encrypted = false
        event_queue = 0
        event_time = 400
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1986
        members = 34
        query_queue = 0
        query_time = 1
serf_wan:
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 571
        members = 8
        query_queue = 0

Operating system and Environment details

NAME="Amazon Linux AMI"
VERSION="2017.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2017.03"
PRETTY_NAME="Amazon Linux AMI 2017.03"
CPE_NAME="cpe:/o:amazon:linux:2017.03:ga"

Description of the Issue (and unexpected/desired result)

This is same as #2896.
Leader node panics when trying to stop autopilot.
The fix in #2911 was merged into 0.8.1 release so I'm surprised to see this panic.

Log Fragments or Link to gist

    2017/04/28 13:27:57 [ERR] consul.session: Invalidation failed: leadership lost while committing log
    2017/04/28 13:27:57 [ERR] consul.session: Invalidation failed: leadership lost while committing log
    2017/04/28 13:27:57 [ERR] consul.session: Invalidation failed: leadership lost while committing log
    2017/04/28 13:27:57 [ERR] consul.session: Invalidation failed: leadership lost while committing log
    2017/04/28 13:27:57 [ERR] consul: failed to wait for barrier: leadership lost while committing log
panic: close of closed channel

goroutine 4090476 [running]:
github.com/hashicorp/consul/consul.(*Server).stopAutopilot(0xc420222f00)
        /gopath/src/github.com/hashicorp/consul/consul/autopilot.go:33 +0x2f
github.com/hashicorp/consul/consul.(*Server).revokeLeadership(0xc420222f00, 0xc657305ef8, 0xb0)
        /gopath/src/github.com/hashicorp/consul/consul/leader.go:189 +0x106
github.com/hashicorp/consul/consul.(*Server).leaderLoop(0xc420222f00, 0xc7f92f84e0)
        /gopath/src/github.com/hashicorp/consul/consul/leader.go:121 +0x50f
github.com/hashicorp/consul/consul.(*Server).monitorLeadership.func1(0xc420222f00, 0xc42000c360, 0xc8e89abb30)
        /gopath/src/github.com/hashicorp/consul/consul/leader.go:48 +0x38
created by github.com/hashicorp/consul/consul.(*Server).monitorLeadership
        /gopath/src/github.com/hashicorp/consul/consul/leader.go:50 +0x23b
@slackpad slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Apr 28, 2017
@slackpad
Copy link
Contributor

slackpad commented Apr 28, 2017

@Gufran thanks for opening an issue - we must be getting multiple or un-ordered stop events so will need to revisit the assumptions around the previous fix.

@slackpad slackpad self-assigned this Apr 28, 2017
@slackpad
Copy link
Contributor

Noted by @magiconair that https://github.com/hashicorp/consul/blob/v0.8.1/consul/snapshot_endpoint.go#L103-L108 looks really suspicious because it is doing the revoke leadership actions not in concert with the main leader loop.

@Gufran do you know if you had perhaps done a snapshot restore previously on the cluster where you saw this panic?

@magiconair magiconair self-assigned this May 1, 2017
@Gufran
Copy link
Contributor Author

Gufran commented May 2, 2017

@slackpad No snapshot was restored however we upscaled the cluster from 3 to 5 about a week ago and then upgraded from 7.1 to 8.1.

@slackpad
Copy link
Contributor

slackpad commented May 4, 2017

The snapshot path should be cleaned up as well, but I got a local test to repro this. It looks like you can get into leaderLoop but never get a chance to run establishLeadership(), so when the defer runs for revokeLeadership() this will happen. I'll trace through this a little more and see how to clean this up.

@slackpad
Copy link
Contributor

slackpad commented May 4, 2017

For the repro I was trying to test #3003 - create a 3 server node cluster using -raft-protocol=2, stop all servers w/o leave, start them up again with -raft-protocol=3. Since they are not properly using their IDs in the configuration, nodes will win the election but then step down because they don't think they have a vote, which hits the panic with the same "failed to wait for barrier error" right before.

@magiconair
Copy link
Contributor

magiconair commented May 4, 2017

I can repro the panic with 0.8.1 after a pkill -9f on all three nodes and restart with -raft-protocol=3. Doesn't happen with version 2.

    2017/05/04 11:15:10 [ERR] consul: failed to wait for barrier: node is not the leader
panic: close of closed channel

goroutine 133 [running]:
github.com/hashicorp/consul/consul.(*Server).stopAutopilot(0xc420286a00)
	/gopath/src/github.com/hashicorp/consul/consul/autopilot.go:33 +0x2f
github.com/hashicorp/consul/consul.(*Server).revokeLeadership(0xc420286a00, 0x7aee2e, 0x1dcd6500)
	/gopath/src/github.com/hashicorp/consul/consul/leader.go:189 +0x106
github.com/hashicorp/consul/consul.(*Server).leaderLoop(0xc420286a00, 0xc42001b560)
	/gopath/src/github.com/hashicorp/consul/consul/leader.go:121 +0x50f
github.com/hashicorp/consul/consul.(*Server).monitorLeadership.func1(0xc420286a00, 0xc42000cd38, 0xc420337180)
	/gopath/src/github.com/hashicorp/consul/consul/leader.go:48 +0x38
created by github.com/hashicorp/consul/consul.(*Server).monitorLeadership
	/gopath/src/github.com/hashicorp/consul/consul/leader.go:50 +0x23b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Projects
None yet
Development

No branches or pull requests

3 participants