Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV tombstone GC can deadlock and stall all writes to a cluster #3700

Closed
slackpad opened this issue Nov 19, 2017 · 4 comments
Closed

KV tombstone GC can deadlock and stall all writes to a cluster #3700

slackpad opened this issue Nov 19, 2017 · 4 comments
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Milestone

Comments

@slackpad
Copy link
Contributor

slackpad commented Nov 19, 2017

While testing Consul 1.0.1 pre-release we discovered a situation where the Consul agent would hang when resolving /v1/agent/self. In this case it was Nomad trying to hit it periodically that triggered the behavior.

We found the following stuck handler:

goroutine profile: total 1272
1131 @ 0x42d6ec 0x42d7de 0x43ed74 0x43e999 0x473b79 0xa287a1 0xf2a44f 0xf2f72d 0xf78b28 0xf72c65 0xf71685 0x684dc4 0x686580 0x687af4 0x683c9d 0x45ce61
#	0x43e998	sync.runtime_Semacquire+0x38						/goroot/src/runtime/sema.go:56
#	0x473b78	sync.(*RWMutex).RLock+0x48						/goroot/src/sync/rwmutex.go:50
#	0xa287a0	github.com/hashicorp/consul/agent/local.(*State).Stats+0x50		/gopath/src/github.com/hashicorp/consul/agent/local/state.go:567
#	0xf2a44e	github.com/hashicorp/consul/agent.(*Agent).Stats+0x21e			/gopath/src/github.com/hashicorp/consul/agent/agent.go:2049
#	0xf2f72c	github.com/hashicorp/consul/agent.(*HTTPServer).AgentSelf+0x3ac		/gopath/src/github.com/hashicorp/consul/agent/agent_endpoint.go:75
#	0xf78b27	github.com/hashicorp/consul/agent.(*HTTPServer).AgentSelf-fm+0x47	/gopath/src/github.com/hashicorp/consul/agent/http.go:101
#	0xf72c64	github.com/hashicorp/consul/agent.(*HTTPServer).wrap.func1+0x664	/gopath/src/github.com/hashicorp/consul/agent/http.go:272
#	0xf71684	github.com/hashicorp/consul/agent.(*HTTPServer).handler.func1.1+0x94	/gopath/src/github.com/hashicorp/consul/agent/http.go:69
#	0x684dc3	net/http.HandlerFunc.ServeHTTP+0x43					/goroot/src/net/http/server.go:1918
#	0x68657f	net/http.(*ServeMux).ServeHTTP+0x12f					/goroot/src/net/http/server.go:2254
#	0x687af3	net/http.serverHandler.ServeHTTP+0xb3					/goroot/src/net/http/server.go:2619
#	0x683c9c	net/http.(*conn).serve+0x71c						/goroot/src/net/http/server.go:1801

And it looks like there's a check update stuck here:

1 @ 0x42d6ec 0x42d7de 0x43ed74 0x43ea8d 0x472c5e 0x473c5d 0xa271bf 0xa205d5 0xa1f9ca 0x45ce61
#	0x43ea8c	sync.runtime_SemacquireMutex+0x3c					/goroot/src/runtime/sema.go:71
#	0x472c5d	sync.(*Mutex).Lock+0xed							/goroot/src/sync/mutex.go:134
#	0x473c5c	sync.(*RWMutex).Lock+0x2c						/goroot/src/sync/rwmutex.go:93
#	0xa271be	github.com/hashicorp/consul/agent/local.(*State).UpdateCheck+0x4e	/gopath/src/github.com/hashicorp/consul/agent/local/state.go:386
#	0xa205d4	github.com/hashicorp/consul/agent/checks.(*CheckHTTP).check+0xb04	/gopath/src/github.com/hashicorp/consul/agent/checks/check.go:420
#	0xa1f9c9	github.com/hashicorp/consul/agent/checks.(*CheckHTTP).run+0x69		/gopath/src/github.com/hashicorp/consul/agent/checks/check.go:360

And a reap stuck here:

1 @ 0x42d6ec 0x42d7de 0x43ed74 0x43e999 0x473b79 0xa2801b 0xf243f1 0xf24742 0x45ce61
#	0x43e998	sync.runtime_Semacquire+0x38							/goroot/src/runtime/sema.go:56
#	0x473b78	sync.(*RWMutex).RLock+0x48							/goroot/src/sync/rwmutex.go:50
#	0xa2801a	github.com/hashicorp/consul/agent/local.(*State).CriticalCheckStates+0x5a	/gopath/src/github.com/hashicorp/consul/agent/local/state.go:519
#	0xf243f0	github.com/hashicorp/consul/agent.(*Agent).reapServicesInternal+0xd0		/gopath/src/github.com/hashicorp/consul/agent/agent.go:1391
#	0xf24741	github.com/hashicorp/consul/agent.(*Agent).reapServices+0x41			/gopath/src/github.com/hashicorp/consul/agent/agent.go:1428

And an AE run stuck here:


1 @ 0x42d6ec 0x43db99 0x91ec84 0x5384aa 0x538ee9 0x7ca91a 0x7db285 0x7cdbd9 0x7cdb5b 0x979cf7 0x979951 0x978ded 0x97c17f 0x9df9f8 0x9ba023 0x45a6eb 0x4bac45 0x4ba224 0x7c61f2 0x7c73ec 0x9e9194 0xa2b91a 0xa29c74 0xa2992c 0xa17440 0xa1834e 0xa16f65 0xa16ee4 0x45ce61
#	0x91ec83	github.com/hashicorp/consul/vendor/github.com/hashicorp/yamux.(*Stream).Read+0x2b3					/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/yamux/stream.go:130
#	0x5384a9	bufio.(*Reader).fill+0x119												/goroot/src/bufio/bufio.go:97
#	0x538ee8	bufio.(*Reader).ReadByte+0x38												/goroot/src/bufio/bufio.go:239
#	0x7ca919	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1+0x79			/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:90
#	0x7db284	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext+0x44		/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/msgpack.go:540
#	0x7cdbd8	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).decode+0x48				/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:635
#	0x7cdb5a	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode+0x7a				/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:630
#	0x979cf6	github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read+0x56			/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:121
#	0x979950	github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader+0x40	/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:74
#	0x978dec	github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec+0x10c				/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc/client.go:29
#	0x97c17e	github.com/hashicorp/consul/agent/pool.(*ConnPool).RPC+0x1ee								/gopath/src/github.com/hashicorp/consul/agent/pool/pool.go:414
#	0x9df9f7	github.com/hashicorp/consul/agent/consul.(*Server).forward+0x597							/gopath/src/github.com/hashicorp/consul/agent/consul/rpc.go:234
#	0x9ba022	github.com/hashicorp/consul/agent/consul.(*Catalog).Register+0xb2							/gopath/src/github.com/hashicorp/consul/agent/consul/catalog_endpoint.go:24
#	0x45a6ea	runtime.call64+0x3a													/goroot/src/runtime/asm_amd64.s:510
#	0x4bac44	reflect.Value.call+0x904												/goroot/src/reflect/value.go:434
#	0x4ba223	reflect.Value.Call+0xa3													/goroot/src/reflect/value.go:302
#	0x7c61f1	net/rpc.(*service).call+0x141												/goroot/src/net/rpc/server.go:381
#	0x7c73eb	net/rpc.(*Server).ServeRequest+0x22b											/goroot/src/net/rpc/server.go:496
#	0x9e9193	github.com/hashicorp/consul/agent/consul.(*Server).RPC+0xc3								/gopath/src/github.com/hashicorp/consul/agent/consul/server.go:983
#	0xa2b919	github.com/hashicorp/consul/agent/local.(*State).syncCheck+0x279							/gopath/src/github.com/hashicorp/consul/agent/local/state.go:982
#	0xa29c73	github.com/hashicorp/consul/agent/local.(*State).SyncChanges+0x313							/gopath/src/github.com/hashicorp/consul/agent/local/state.go:802
#	0xa2992b	github.com/hashicorp/consul/agent/local.(*State).SyncFull+0x5b								/gopath/src/github.com/hashicorp/consul/agent/local/state.go:760
#	0xa1743f	github.com/hashicorp/consul/agent/ae.(*StateSyncer).nextFSMState+0x4af							/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:167
#	0xa1834d	github.com/hashicorp/consul/agent/ae.(*StateSyncer).(github.com/hashicorp/consul/agent/ae.nextFSMState)-fm+0x3d		/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:147
#	0xa16f64	github.com/hashicorp/consul/agent/ae.(*StateSyncer).runFSM+0x44								/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:153
#	0xa16ee3	github.com/hashicorp/consul/agent/ae.(*StateSyncer).Run+0x63								/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:147

Here's the full gist of the stack traces for the agent - https://gist.github.com/slackpad/0e2e55d5d4656f82b458ef719ffcd6c1.

@slackpad slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Nov 19, 2017
@slackpad slackpad added this to the 1.0.1 milestone Nov 19, 2017
@slackpad
Copy link
Contributor Author

The AE run stuck on I/O looks like it's holding the lock and causing the other goroutines to block, so this current trace seems like a symptom of running out of FDs and not necessarily the cause.

@slackpad
Copy link
Contributor Author

Ugh - this is a nasty one. Running another test with instrumentation on each server (periodic lsof and goroutine dumps) we realized that this is actually a backup on the leader. The consul-soak load includes a lot of KV activity and here's what looks to be happening:

The leader loop is waiting to commit the barrier, so it's not servicing the tombstone GC channel:

1 @ 0x42d6ec 0x42d7de 0x404ad4 0x40477b 0x7ec565 0x9cf84a 0x9fa06b 0x45ce61
#       0x7ec564        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*deferError).Error+0x74   /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/future.go:100
#       0x9cf849        github.com/hashicorp/consul/agent/consul.(*Server).leaderLoop+0x6b9                     /gopath/src/github.com/hashicorp/consul/agent/consul/leader.go:112
#       0x9fa06a        github.com/hashicorp/consul/agent/consul.(*Server).monitorLeadership.func1+0x5a         /gopath/src/github.com/hashicorp/consul/agent/consul/leader.go:52

Meanwhile a GC timer fired and holds the tombstone GC lock, and is waiting to queue something into the channel above:

1 @ 0x42d6ec 0x42d7de 0x403cab 0x403a33 0x96dd0a 0x96fc15 0x45ce61
#       0x96dd09        github.com/hashicorp/consul/agent/consul/state.(*TombstoneGC).expireTime+0x139  /gopath/src/github.com/hashicorp/consul/agent/consul/state/tombstone_gc.go:158
#       0x96fc14        github.com/hashicorp/consul/agent/consul/state.(*TombstoneGC).Hint.func1+0x44   /gopath/src/github.com/hashicorp/consul/agent/consul/state/tombstone_gc.go:121

But there's a KVS-related FSM operation running that's stuck on the tombstone GC lock:

1 @ 0x42d6ec 0x42d7de 0x43ed74 0x43ea8d 0x472c5e 0x96d6ca 0x96fa26 0x912ed8 0x95fbf1 0x9c527e 0x9c3a2d 0x804bfb 0x7ec41e 0x806d7a 0x8064c3 0x45ce61
#       0x43ea8c        sync.runtime_SemacquireMutex+0x3c                                                                                                                       /goroot/src/runtime/sema.go:71
#       0x472c5d        sync.(*Mutex).Lock+0xed                                                                                                                                 /goroot/src/sync/mutex.go:134
#       0x96d6c9        github.com/hashicorp/consul/agent/consul/state.(*TombstoneGC).Hint+0x79                                                                                 /gopath/src/github.com/hashicorp/consul/agent/consul/state/tombstone_gc.go:101
#       0x96fa25        github.com/hashicorp/consul/agent/consul/state.(*Graveyard).InsertTxn.func1+0x35                                                                        /gopath/src/github.com/hashicorp/consul/agent/consul/state/graveyard.go:41
#       0x912ed7        github.com/hashicorp/consul/vendor/github.com/hashicorp/go-memdb.(*Txn).Commit+0x3d7                                                                    /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-memdb/txn.go:147
#       0x95fbf0        github.com/hashicorp/consul/agent/consul/state.(*Store).KVSDeleteTree+0xe0                                                                              /gopath/src/github.com/hashicorp/consul/agent/consul/state/kvs.go:416
#       0x9c527d        github.com/hashicorp/consul/agent/consul.(*consulFSM).applyKVSOperation+0x93d                                                                           /gopath/src/github.com/hashicorp/consul/agent/consul/fsm.go:193
#       0x9c3a2c        github.com/hashicorp/consul/agent/consul.(*consulFSM).Apply+0x24c                                                                                       /gopath/src/github.com/hashicorp/consul/agent/consul/fsm.go:102
#       0x804bfa        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1+0x17a                                                                 /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/fsm.go:57
#       0x7ec41d        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*Raft).runFSM+0x31d                                                                       /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/fsm.go:120
#       0x806d79        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.runFSM)-fm+0x29      /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/api.go:506
#       0x8064c2        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1+0x52                                                             /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/state.go:146

And that's keeping the barrier from going through, so we have a deadlock. It seems like the simplest fix here would be to prevent the Hint() call from ever blocking since this is a low-consequence operation (it's ok to leave tombstones around for longer). We can add a little buffering to the expire channel to help avoid this during busy times.

@slackpad
Copy link
Contributor Author

So the /v1/agent/self hang is because Raft was tied up with the issue above - it was a symptom but not the cause.

slackpad added a commit that referenced this issue Nov 20, 2017
The lock isn't needed after we clean up the expire bin, and as seen
in #3700 we can get into a deadlock waiting to place the expire index
into the channel while holding this lock.

Fixes #3700
@slackpad slackpad changed the title Hang in /v1/agent/self KV tombstone GC can deadlock and stall all writes to a cluster Nov 21, 2017
@nh2
Copy link

nh2 commented Feb 13, 2018

For anybody like me wondering how to obtain the goroutines stack trace dump as in the issue description, set "enable_debug": true in your consul config and go to http://localhost:8500/debug/pprof/goroutine?debug=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace
Projects
None yet
Development

No branches or pull requests

2 participants