Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in tombstone_gc.go #2087

Closed
shomron opened this issue Jun 2, 2016 · 6 comments · Fixed by #2973
Closed

Panic in tombstone_gc.go #2087

shomron opened this issue Jun 2, 2016 · 6 comments · Fixed by #2973
Assignees
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace

Comments

@shomron
Copy link

shomron commented Jun 2, 2016

If you have a question, please direct it to the
consul mailing list if it hasn't been
addressed in either the FAQ or in one
of the Consul Guides.

When filing a bug, please include the following:

consul version for both Client and Server

Client: Consul v0.6.4
Server: Consul v0.6.4

consul info for both Client and Server

Client:

Vault 0.5.2 - internal consul client.

Server:

agent:
    check_monitors = 0
    check_ttls = 1
    checks = 1
    services = 2
build:
    prerelease =
    revision = 26a0ef8c
    version = 0.6.4
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    server = true
raft:
    applied_index = 2774
    commit_index = 2774
    fsm_pending = 0
    last_contact = 0
    last_log_index = 2774
    last_log_term = 17
    last_snapshot_index = 0
    last_snapshot_term = 0
    num_peers = 3
    state = Leader
    term = 17
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 80
    max_procs = 4
    os = linux
    version = go1.6
serf_lan:
    encrypted = true
    event_queue = 0
    event_time = 12
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 2
    members = 3
    query_queue = 0
    query_time = 1
serf_wan:
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

Operating system and Environment details

OS X 10.11.5 (15F34)
Docker Version 1.11.1-beta13.1 (build: 8193)
Image: library/consul:v0.6.4 7c41a9d97d38

Description of the Issue (and unexpected/desired result)

Panic on one node out of a 3-node cluster. All nodes are running locally in Docker using the official image. The networking errors may be due to machine sleeping at one point.

Reproduction steps

None at this time.

Log Fragments or Link to gist

consul0    |     2016/06/02 17:38:47 [WARN] raft: Failed to contact quorum of nodes, stepping down
consul0    |     2016/06/02 17:38:47 [INFO] raft: Node at 172.18.0.2:8300 [Follower] entering Follower state
consul0    |     2016/06/02 17:38:47 [INFO] consul: cluster leadership lost
consul0    |     2016/06/02 17:38:47 [WARN] agent: Check 'vault-sealed-check' missed TTL, is now critical
consul0    |     2016/06/02 17:38:47 [ERR] agent: failed to sync changes: No cluster leader
consul0    |     2016/06/02 17:38:47 [ERR] agent: failed to sync changes: No cluster leader
consul0    |     2016/06/02 17:38:47 [INFO] raft: aborting pipeline replication to peer 172.18.0.4:8300
consul0    |     2016/06/02 17:38:47 [ERR] agent: coordinate update error: No cluster leader
consul0    |     2016/06/02 17:38:47 [INFO] raft: aborting pipeline replication to peer 172.18.0.3:8300
consul0    |     2016/06/02 17:38:47 [ERR] http: Request GET /v1/kv/vault/core/lock?index=1274&token=<hidden>&wait=15000ms, error: No cluster leader from=172.18.0.6:58450
consul0    |     2016/06/02 17:38:47 [ERR] http: Request PUT /v1/session/destroy/2ac34feb-989f-bc05-ec93-618106445b19?token=<hidden>, error: No cluster leader from=172.18.0.6:58456
consul0    | panic: runtime error: invalid memory address or nil pointer dereference
consul0    | [signal 0xb code=0x1 addr=0x0 pc=0x73a459]
consul0    |
consul0    | goroutine 944 [running]:
consul0    | panic(0xcb9ec0, 0xc820010070)
consul0    |     /goroot/src/runtime/panic.go:464 +0x3e6
consul0    | github.com/hashicorp/consul/consul/state.(*TombstoneGC).expireTime(0xc8201eda10, 0xecee26616, 0x0, 0x12cce40)
consul0    |     /gopath/src/github.com/hashicorp/consul/consul/state/tombstone_gc.go:149 +0x119
consul0    | github.com/hashicorp/consul/consul/state.(*TombstoneGC).Hint.func1()
consul0    |     /gopath/src/github.com/hashicorp/consul/consul/state/tombstone_gc.go:120 +0x3f
consul0    | created by time.goFunc
consul0    |     /goroot/src/time/sleep.go:129 +0x3a
@slackpad slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Jun 2, 2016
@slackpad slackpad added this to the 0.7.2 milestone Nov 22, 2016
@slackpad slackpad modified the milestones: 0.7.3, 0.7.2 Dec 15, 2016
@slackpad
Copy link
Contributor

Going to kick this one forward - I think the PR still might have some race condition potential, but need to look at this in a little more detail.

@slackpad slackpad modified the milestones: 0.7.4, 0.7.3 Jan 17, 2017
@slackpad slackpad removed this from the Triaged milestone Apr 18, 2017
@deckarep
Copy link
Contributor

deckarep commented Apr 25, 2017

@slackpad - i believe the race-condition is here:

Line 116 of consul/state/tombstone_gc.go

        // Create new expiration time
	t.expires[expires] = &expireInterval{
		maxIndex: index,
		timer: time.AfterFunc(expires.Sub(time.Now()), func() {
			t.expireTime(expires)
		}),
	}

In the above code, you have a happens-before case "race condition" where the time.AfterFunc can fire before the actual expireInterval entry is even added to the t.expires map. I was able to reproduce by artificially shortening the duration to near 0 which is a possible case as an argument to the time.AfterFunc.

Should the race condition occur you'll end up with a panic because a expireInterval will not be found in the map within the expireTime method.

A fix could be to just do a noop when that occurs like so in the comment below.

@deckarep
Copy link
Contributor

deckarep commented Apr 25, 2017

expireTime method:

// expireTime is used to expire the entries at the given time
func (t *TombstoneGC) expireTime(expires time.Time) {
	// Get the maximum index and clear the entry
	t.lock.Lock()
	defer t.lock.Unlock()
	exp, ok := t.expires[expires]
	if !ok {
		return
	}

	delete(t.expires, expires)

	// Notify the expires channel
	t.expireCh <- exp.maxIndex
}

@slackpad
Copy link
Contributor

Thanks for taking a look, @deckarep. I think that Hint() properly holds the lock while it's adding to t.expires, though, so even if the timer fires then expireTime() should have to wait to read it since it relies on the same lock.

@slackpad
Copy link
Contributor

I think the race is in here - https://github.com/hashicorp/consul/blob/v0.6.4/consul/state/tombstone_gc.go#L83-L89.

The timer fires before the Stop() but is blocked on the lock, which will fail when expireTime() runs since the map has been cleared out. The fix in #2087 (comment) is a good one so it's always safe to run a timer. I'll add a PR for that.

@deckarep
Copy link
Contributor

@slackpad - ah that does make sense. Great analysis on this one...nailing down a solution to these bugs is always tricky. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants