Skip to content
This repository has been archived by the owner on Sep 22, 2020. It is now read-only.

distclient: fix deadlock between server and distclient #257

Merged
merged 1 commit into from
Jun 13, 2016

Conversation

sgotti
Copy link
Member

@sgotti sgotti commented Jun 11, 2016

This patch makes the distClient mutex locking/unlocking more fine
grained to avoid having the distClient mutex locked when calling
server.[Get|Update]PeerMap

This should fix this deadlock:
during some HA tests, stopping/starting various torusd processes, noticed that the other torusd and torusblk were all blocked. I got a stacktrace of all the processes and I found this possible deadlock (in both the torusd and torusblk processes).

goroutine 87 [semacquire, 53 minutes]:
sync.runtime_Semacquire(0xc820150234)
        /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc820150230)
        /usr/local/go/src/sync/mutex.go:83 +0x1c4
github.com/coreos/torus.(*Server).GetPeerMap(0xc820150200, 0x0)
        /home/sgotti/golang/src/github.com/coreos/torus/server.go:98 +0x4c
github.com/coreos/torus/distributor.(*distClient).getConn(0xc8202ab180, 0xc8202af4d0, 0x24, 0x0, 0x0)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:59 +0x146
github.com/coreos/torus/distributor.(*distClient).Check(0xc8202ab180, 0x7ff5c602aaa0, 0xc8217e4ea0, 0xc8202af4d0, 0x24, 0xc82164b800, 0x32, 0x40, 0x0, 0x0, ...)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:150 +0x76
github.com/coreos/torus/distributor/rebalance.(*rebalancer).Tick(0xc82030d4a0, 0xc820393020, 0x0, 0x0)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/rebalance/tick.go:63 +0x5e5
github.com/coreos/torus/distributor.(*Distributor).rebalanceTicker(0xc8202f2cf0, 0xc82030d500)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/rebalance.go:67 +0x490
created by github.com/coreos/torus/distributor.newDistributor
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/distributor.go:69 +0x8fe

goroutine 41 [semacquire, 53 minutes]:
sync.runtime_Semacquire(0xc8202ab194)
        /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc8202ab190)
        /usr/local/go/src/sync/mutex.go:83 +0x1c4
github.com/coreos/torus/distributor.(*distClient).onPeerTimeout(0xc8202ab180, 0xc8213556e0, 0x24)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:40 +0x48
github.com/coreos/torus/distributor.(*distClient).(github.com/coreos/torus/distributor.onPeerTimeout)-fm(0xc8213556e0, 0x24)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:35 +0x34
github.com/coreos/torus.(*Server).updatePeerMap(0xc820150200)
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:124 +0x4e3
github.com/coreos/torus.(*Server).oneHeartbeat(0xc820150200)
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:100 +0x35b
github.com/coreos/torus.(*Server).heartbeat(0xc820150200, 0xc8202e88a0)
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:69 +0x30
created by github.com/coreos/torus.(*Server).BeginHeartbeat
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:62 +0x4a8

Looks like:

goroutine A goroutine B
calls distClient.getConn
distClient.getConn takes the distClient.mut lock
calls server.GetPeerMap
oneHearBeat takes the server.infoMut
calls server.updatePeerMap
calls the timeoutCallBack: distClient.onPeerTimeout
distClient.onPeerTimeout blocks trying to take the distClient.mut lock
server.GetPeerMap blocks trying to take server.infoMut lock

Should fix #238

This patch makes the distClient mutex locking/unlocking more fine
grained to avoid having the distClient mutex locked when calling
server.[Get|Update]PeerMap

This should fix this deadlock:
during some HA tests, stopping/starting various torusd processes, noticed that the other `torusd` and `torusblk` were all blocked. I got a stacktrace of all the processes and I found this possible deadlock (in both the `torusd` and `torusblk` processes).

```
goroutine 87 [semacquire, 53 minutes]:
sync.runtime_Semacquire(0xc820150234)
        /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc820150230)
        /usr/local/go/src/sync/mutex.go:83 +0x1c4
github.com/coreos/torus.(*Server).GetPeerMap(0xc820150200, 0x0)
        /home/sgotti/golang/src/github.com/coreos/torus/server.go:98 +0x4c
github.com/coreos/torus/distributor.(*distClient).getConn(0xc8202ab180, 0xc8202af4d0, 0x24, 0x0, 0x0)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:59 +0x146
github.com/coreos/torus/distributor.(*distClient).Check(0xc8202ab180, 0x7ff5c602aaa0, 0xc8217e4ea0, 0xc8202af4d0, 0x24, 0xc82164b800, 0x32, 0x40, 0x0, 0x0, ...)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:150 +0x76
github.com/coreos/torus/distributor/rebalance.(*rebalancer).Tick(0xc82030d4a0, 0xc820393020, 0x0, 0x0)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/rebalance/tick.go:63 +0x5e5
github.com/coreos/torus/distributor.(*Distributor).rebalanceTicker(0xc8202f2cf0, 0xc82030d500)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/rebalance.go:67 +0x490
created by github.com/coreos/torus/distributor.newDistributor
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/distributor.go:69 +0x8fe

goroutine 41 [semacquire, 53 minutes]:
sync.runtime_Semacquire(0xc8202ab194)
        /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc8202ab190)
        /usr/local/go/src/sync/mutex.go:83 +0x1c4
github.com/coreos/torus/distributor.(*distClient).onPeerTimeout(0xc8202ab180, 0xc8213556e0, 0x24)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:40 +0x48
github.com/coreos/torus/distributor.(*distClient).(github.com/coreos/torus/distributor.onPeerTimeout)-fm(0xc8213556e0, 0x24)
        /home/sgotti/golang/src/github.com/coreos/torus/distributor/client.go:35 +0x34
github.com/coreos/torus.(*Server).updatePeerMap(0xc820150200)
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:124 +0x4e3
github.com/coreos/torus.(*Server).oneHeartbeat(0xc820150200)
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:100 +0x35b
github.com/coreos/torus.(*Server).heartbeat(0xc820150200, 0xc8202e88a0)
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:69 +0x30
created by github.com/coreos/torus.(*Server).BeginHeartbeat
        /home/sgotti/golang/src/github.com/coreos/torus/heartbeat.go:62 +0x4a8
```

Looks like:

goroutine A                                                    | goroutine B
---------------------------------------------------------------|------------------------------------------------
calls distClient.getConn                                       |
distClient.getConn takes the distClient.mut lock               |
calls server.GetPeerMap                                        |
                                                               | oneHearBeat takes the server.infoMut
                                                               | calls server.updatePeerMap
                                                               | calls the timeoutCallBack: distClient.onPeerTimeout
                                                               | distClient.onPeerTimeout blocks trying to take the distClient.mut lock
server.GetPeerMap blocks trying to take server.infoMut lock    |
@barakmich
Copy link
Contributor

👏 Nice catch.

@barakmich barakmich merged commit 35fd1de into coreos:master Jun 13, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock between server and distclient
2 participants