KV tombstone GC can deadlock and stall all writes to a cluster #3700

slackpad · 2017-11-19T03:55:03Z

While testing Consul 1.0.1 pre-release we discovered a situation where the Consul agent would hang when resolving /v1/agent/self. In this case it was Nomad trying to hit it periodically that triggered the behavior.

We found the following stuck handler:

goroutine profile: total 1272
1131 @ 0x42d6ec 0x42d7de 0x43ed74 0x43e999 0x473b79 0xa287a1 0xf2a44f 0xf2f72d 0xf78b28 0xf72c65 0xf71685 0x684dc4 0x686580 0x687af4 0x683c9d 0x45ce61
#	0x43e998	sync.runtime_Semacquire+0x38						/goroot/src/runtime/sema.go:56
#	0x473b78	sync.(*RWMutex).RLock+0x48						/goroot/src/sync/rwmutex.go:50
#	0xa287a0	github.com/hashicorp/consul/agent/local.(*State).Stats+0x50		/gopath/src/github.com/hashicorp/consul/agent/local/state.go:567
#	0xf2a44e	github.com/hashicorp/consul/agent.(*Agent).Stats+0x21e			/gopath/src/github.com/hashicorp/consul/agent/agent.go:2049
#	0xf2f72c	github.com/hashicorp/consul/agent.(*HTTPServer).AgentSelf+0x3ac		/gopath/src/github.com/hashicorp/consul/agent/agent_endpoint.go:75
#	0xf78b27	github.com/hashicorp/consul/agent.(*HTTPServer).AgentSelf-fm+0x47	/gopath/src/github.com/hashicorp/consul/agent/http.go:101
#	0xf72c64	github.com/hashicorp/consul/agent.(*HTTPServer).wrap.func1+0x664	/gopath/src/github.com/hashicorp/consul/agent/http.go:272
#	0xf71684	github.com/hashicorp/consul/agent.(*HTTPServer).handler.func1.1+0x94	/gopath/src/github.com/hashicorp/consul/agent/http.go:69
#	0x684dc3	net/http.HandlerFunc.ServeHTTP+0x43					/goroot/src/net/http/server.go:1918
#	0x68657f	net/http.(*ServeMux).ServeHTTP+0x12f					/goroot/src/net/http/server.go:2254
#	0x687af3	net/http.serverHandler.ServeHTTP+0xb3					/goroot/src/net/http/server.go:2619
#	0x683c9c	net/http.(*conn).serve+0x71c						/goroot/src/net/http/server.go:1801

And it looks like there's a check update stuck here:

1 @ 0x42d6ec 0x42d7de 0x43ed74 0x43ea8d 0x472c5e 0x473c5d 0xa271bf 0xa205d5 0xa1f9ca 0x45ce61
#	0x43ea8c	sync.runtime_SemacquireMutex+0x3c					/goroot/src/runtime/sema.go:71
#	0x472c5d	sync.(*Mutex).Lock+0xed							/goroot/src/sync/mutex.go:134
#	0x473c5c	sync.(*RWMutex).Lock+0x2c						/goroot/src/sync/rwmutex.go:93
#	0xa271be	github.com/hashicorp/consul/agent/local.(*State).UpdateCheck+0x4e	/gopath/src/github.com/hashicorp/consul/agent/local/state.go:386
#	0xa205d4	github.com/hashicorp/consul/agent/checks.(*CheckHTTP).check+0xb04	/gopath/src/github.com/hashicorp/consul/agent/checks/check.go:420
#	0xa1f9c9	github.com/hashicorp/consul/agent/checks.(*CheckHTTP).run+0x69		/gopath/src/github.com/hashicorp/consul/agent/checks/check.go:360

And a reap stuck here:

1 @ 0x42d6ec 0x42d7de 0x43ed74 0x43e999 0x473b79 0xa2801b 0xf243f1 0xf24742 0x45ce61
#	0x43e998	sync.runtime_Semacquire+0x38							/goroot/src/runtime/sema.go:56
#	0x473b78	sync.(*RWMutex).RLock+0x48							/goroot/src/sync/rwmutex.go:50
#	0xa2801a	github.com/hashicorp/consul/agent/local.(*State).CriticalCheckStates+0x5a	/gopath/src/github.com/hashicorp/consul/agent/local/state.go:519
#	0xf243f0	github.com/hashicorp/consul/agent.(*Agent).reapServicesInternal+0xd0		/gopath/src/github.com/hashicorp/consul/agent/agent.go:1391
#	0xf24741	github.com/hashicorp/consul/agent.(*Agent).reapServices+0x41			/gopath/src/github.com/hashicorp/consul/agent/agent.go:1428

And an AE run stuck here:


1 @ 0x42d6ec 0x43db99 0x91ec84 0x5384aa 0x538ee9 0x7ca91a 0x7db285 0x7cdbd9 0x7cdb5b 0x979cf7 0x979951 0x978ded 0x97c17f 0x9df9f8 0x9ba023 0x45a6eb 0x4bac45 0x4ba224 0x7c61f2 0x7c73ec 0x9e9194 0xa2b91a 0xa29c74 0xa2992c 0xa17440 0xa1834e 0xa16f65 0xa16ee4 0x45ce61
#	0x91ec83	github.com/hashicorp/consul/vendor/github.com/hashicorp/yamux.(*Stream).Read+0x2b3					/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/yamux/stream.go:130
#	0x5384a9	bufio.(*Reader).fill+0x119												/goroot/src/bufio/bufio.go:97
#	0x538ee8	bufio.(*Reader).ReadByte+0x38												/goroot/src/bufio/bufio.go:239
#	0x7ca919	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1+0x79			/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:90
#	0x7db284	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext+0x44		/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/msgpack.go:540
#	0x7cdbd8	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).decode+0x48				/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:635
#	0x7cdb5a	github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode+0x7a				/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-msgpack/codec/decode.go:630
#	0x979cf6	github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read+0x56			/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:121
#	0x979950	github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader+0x40	/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc/codec.go:74
#	0x978dec	github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec+0x10c				/gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/net-rpc-msgpackrpc/client.go:29
#	0x97c17e	github.com/hashicorp/consul/agent/pool.(*ConnPool).RPC+0x1ee								/gopath/src/github.com/hashicorp/consul/agent/pool/pool.go:414
#	0x9df9f7	github.com/hashicorp/consul/agent/consul.(*Server).forward+0x597							/gopath/src/github.com/hashicorp/consul/agent/consul/rpc.go:234
#	0x9ba022	github.com/hashicorp/consul/agent/consul.(*Catalog).Register+0xb2							/gopath/src/github.com/hashicorp/consul/agent/consul/catalog_endpoint.go:24
#	0x45a6ea	runtime.call64+0x3a													/goroot/src/runtime/asm_amd64.s:510
#	0x4bac44	reflect.Value.call+0x904												/goroot/src/reflect/value.go:434
#	0x4ba223	reflect.Value.Call+0xa3													/goroot/src/reflect/value.go:302
#	0x7c61f1	net/rpc.(*service).call+0x141												/goroot/src/net/rpc/server.go:381
#	0x7c73eb	net/rpc.(*Server).ServeRequest+0x22b											/goroot/src/net/rpc/server.go:496
#	0x9e9193	github.com/hashicorp/consul/agent/consul.(*Server).RPC+0xc3								/gopath/src/github.com/hashicorp/consul/agent/consul/server.go:983
#	0xa2b919	github.com/hashicorp/consul/agent/local.(*State).syncCheck+0x279							/gopath/src/github.com/hashicorp/consul/agent/local/state.go:982
#	0xa29c73	github.com/hashicorp/consul/agent/local.(*State).SyncChanges+0x313							/gopath/src/github.com/hashicorp/consul/agent/local/state.go:802
#	0xa2992b	github.com/hashicorp/consul/agent/local.(*State).SyncFull+0x5b								/gopath/src/github.com/hashicorp/consul/agent/local/state.go:760
#	0xa1743f	github.com/hashicorp/consul/agent/ae.(*StateSyncer).nextFSMState+0x4af							/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:167
#	0xa1834d	github.com/hashicorp/consul/agent/ae.(*StateSyncer).(github.com/hashicorp/consul/agent/ae.nextFSMState)-fm+0x3d		/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:147
#	0xa16f64	github.com/hashicorp/consul/agent/ae.(*StateSyncer).runFSM+0x44								/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:153
#	0xa16ee3	github.com/hashicorp/consul/agent/ae.(*StateSyncer).Run+0x63								/gopath/src/github.com/hashicorp/consul/agent/ae/ae.go:147

Here's the full gist of the stack traces for the agent - https://gist.github.com/slackpad/0e2e55d5d4656f82b458ef719ffcd6c1.

The text was updated successfully, but these errors were encountered:

slackpad · 2017-11-19T04:17:27Z

The AE run stuck on I/O looks like it's holding the lock and causing the other goroutines to block, so this current trace seems like a symptom of running out of FDs and not necessarily the cause.

slackpad · 2017-11-19T23:52:57Z

Ugh - this is a nasty one. Running another test with instrumentation on each server (periodic lsof and goroutine dumps) we realized that this is actually a backup on the leader. The consul-soak load includes a lot of KV activity and here's what looks to be happening:

The leader loop is waiting to commit the barrier, so it's not servicing the tombstone GC channel:

1 @ 0x42d6ec 0x42d7de 0x404ad4 0x40477b 0x7ec565 0x9cf84a 0x9fa06b 0x45ce61
#       0x7ec564        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*deferError).Error+0x74   /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/future.go:100
#       0x9cf849        github.com/hashicorp/consul/agent/consul.(*Server).leaderLoop+0x6b9                     /gopath/src/github.com/hashicorp/consul/agent/consul/leader.go:112
#       0x9fa06a        github.com/hashicorp/consul/agent/consul.(*Server).monitorLeadership.func1+0x5a         /gopath/src/github.com/hashicorp/consul/agent/consul/leader.go:52

Meanwhile a GC timer fired and holds the tombstone GC lock, and is waiting to queue something into the channel above:

1 @ 0x42d6ec 0x42d7de 0x403cab 0x403a33 0x96dd0a 0x96fc15 0x45ce61
#       0x96dd09        github.com/hashicorp/consul/agent/consul/state.(*TombstoneGC).expireTime+0x139  /gopath/src/github.com/hashicorp/consul/agent/consul/state/tombstone_gc.go:158
#       0x96fc14        github.com/hashicorp/consul/agent/consul/state.(*TombstoneGC).Hint.func1+0x44   /gopath/src/github.com/hashicorp/consul/agent/consul/state/tombstone_gc.go:121

But there's a KVS-related FSM operation running that's stuck on the tombstone GC lock:

1 @ 0x42d6ec 0x42d7de 0x43ed74 0x43ea8d 0x472c5e 0x96d6ca 0x96fa26 0x912ed8 0x95fbf1 0x9c527e 0x9c3a2d 0x804bfb 0x7ec41e 0x806d7a 0x8064c3 0x45ce61
#       0x43ea8c        sync.runtime_SemacquireMutex+0x3c                                                                                                                       /goroot/src/runtime/sema.go:71
#       0x472c5d        sync.(*Mutex).Lock+0xed                                                                                                                                 /goroot/src/sync/mutex.go:134
#       0x96d6c9        github.com/hashicorp/consul/agent/consul/state.(*TombstoneGC).Hint+0x79                                                                                 /gopath/src/github.com/hashicorp/consul/agent/consul/state/tombstone_gc.go:101
#       0x96fa25        github.com/hashicorp/consul/agent/consul/state.(*Graveyard).InsertTxn.func1+0x35                                                                        /gopath/src/github.com/hashicorp/consul/agent/consul/state/graveyard.go:41
#       0x912ed7        github.com/hashicorp/consul/vendor/github.com/hashicorp/go-memdb.(*Txn).Commit+0x3d7                                                                    /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/go-memdb/txn.go:147
#       0x95fbf0        github.com/hashicorp/consul/agent/consul/state.(*Store).KVSDeleteTree+0xe0                                                                              /gopath/src/github.com/hashicorp/consul/agent/consul/state/kvs.go:416
#       0x9c527d        github.com/hashicorp/consul/agent/consul.(*consulFSM).applyKVSOperation+0x93d                                                                           /gopath/src/github.com/hashicorp/consul/agent/consul/fsm.go:193
#       0x9c3a2c        github.com/hashicorp/consul/agent/consul.(*consulFSM).Apply+0x24c                                                                                       /gopath/src/github.com/hashicorp/consul/agent/consul/fsm.go:102
#       0x804bfa        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1+0x17a                                                                 /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/fsm.go:57
#       0x7ec41d        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*Raft).runFSM+0x31d                                                                       /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/fsm.go:120
#       0x806d79        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*Raft).(github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.runFSM)-fm+0x29      /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/api.go:506
#       0x8064c2        github.com/hashicorp/consul/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1+0x52                                                             /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/raft/state.go:146

And that's keeping the barrier from going through, so we have a deadlock. It seems like the simplest fix here would be to prevent the Hint() call from ever blocking since this is a low-consequence operation (it's ok to leave tombstones around for longer). We can add a little buffering to the expire channel to help avoid this during busy times.

slackpad · 2017-11-19T23:53:34Z

So the /v1/agent/self hang is because Raft was tied up with the issue above - it was a symptom but not the cause.

The lock isn't needed after we clean up the expire bin, and as seen in #3700 we can get into a deadlock waiting to place the expire index into the channel while holding this lock. Fixes #3700

nh2 · 2018-02-13T13:15:40Z

For anybody like me wondering how to obtain the goroutines stack trace dump as in the issue description, set "enable_debug": true in your consul config and go to http://localhost:8500/debug/pprof/goroutine?debug=1.

slackpad added type/bug Feature does not function as expected type/crash The issue description contains a golang panic and stack trace labels Nov 19, 2017

slackpad added this to the 1.0.1 milestone Nov 19, 2017

slackpad mentioned this issue Nov 20, 2017

Gives back the lock before writing to the expire channel. #3701

Merged

slackpad closed this as completed in #3701 Nov 20, 2017

slackpad changed the title ~~Hang in /v1/agent/self~~ KV tombstone GC can deadlock and stall all writes to a cluster Nov 21, 2017

slackpad mentioned this issue Nov 22, 2017

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader #3552

Closed

slackpad mentioned this issue Nov 30, 2017

Consul server 1.0.0 stalled after some time. #3721

Closed

slackpad mentioned this issue Jan 9, 2018

Consul write api hangs for half hour before return rpc error #3738

Open

nh2 mentioned this issue Feb 13, 2018

consul doesn't expire proxy keys sorintlab/stolon#397

Closed

nh2 mentioned this issue Feb 26, 2018

consul: 0.9.3 -> 1.0.6 NixOS/nixpkgs#35602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV tombstone GC can deadlock and stall all writes to a cluster #3700

KV tombstone GC can deadlock and stall all writes to a cluster #3700

slackpad commented Nov 19, 2017 •

edited

Loading

slackpad commented Nov 19, 2017

slackpad commented Nov 19, 2017

slackpad commented Nov 19, 2017

nh2 commented Feb 13, 2018 •

edited

Loading

KV tombstone GC can deadlock and stall all writes to a cluster #3700

KV tombstone GC can deadlock and stall all writes to a cluster #3700

Comments

slackpad commented Nov 19, 2017 • edited Loading

slackpad commented Nov 19, 2017

slackpad commented Nov 19, 2017

slackpad commented Nov 19, 2017

nh2 commented Feb 13, 2018 • edited Loading

slackpad commented Nov 19, 2017 •

edited

Loading

nh2 commented Feb 13, 2018 •

edited

Loading