Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minogrpc: deadlock in non tree based routing #170

Open
gnarula opened this issue Jan 4, 2021 · 1 comment
Open

minogrpc: deadlock in non tree based routing #170

gnarula opened this issue Jan 4, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@gnarula
Copy link
Contributor

gnarula commented Jan 4, 2021

This came up while trying out RABYT. Consider a scenario where we have 15 nodes (0-14).

Node 0 broadcasts a message to the other 14 nodes. Some of them are via intermediate nodes so they take 2 hops. Now each node on receiving the message from Node0 tries to send a response back to it. Let's assume the routing to Node0 is via Node9. Here's how a deadlock can occur:

Node0 holds an RLock on session.parentsLock while it tries to broadcast a message. At the same time, Node9 tries to establish a connection to Node0. When the Stream GRPC handler is invoked on the server side it tries to acquire a write lock on session.parentsLock in the call to sess.Listen() and blocks. Eventually both Node0 and Node9 aren't able to make progress. Since other nodes depend on Node9 to route message to Node0 none of the nodes are able to continue.

Possible workaround:

session.RecvPacket and session.Send can avoid holding the RLock by copying the references to values of s.parents in a local slice. The risk then would be trying to use a relay that has been closed when session.Listen exits. Maybe we can have a per parent lock instead to avoid it as suggested by @cache-nez

@gnarula
Copy link
Contributor Author

gnarula commented Jan 4, 2021

Logs for a sample run with 15 nodes (1-15 running on ports 2000-2009, 20010, 20011, 20012, 20013, 20014).

https://drive.google.com/file/d/1a-TPPOXJzTFpmUEAFuIxhuzdeszeWChE/view?usp=sharing

The following table describes the Nodes and their IDs

Node # Port ID
1 2000 JPI
2 2001 NKM
3 2002 EEF
4 2003 IFK
5 2004 CPO
6 2005 MCH
7 2006 HKF
8 2007 OLI
9 2008 PCN
10 2009 FNL
11 20010 DCB
12 20011 AAD
13 20012 HEC
14 20013 JEN
15 20014 MFE

Node 1 tries to broadcast a message to all the other nodes. Note that Node 14's ID has a common prefix as Node1 J. The deadlock occurs because:

  1. Node1 broadcasts a message to other peers. This results in Node1's server side holding an RLock on session.parentsLock (in session.RecvPacket). Let's consider Node1 sending a message to Node13 via Node7 (common prefix H). On receiving a message, Node13 tries to send a response back to Node1 via Node 14 (common prefix J)
  2. On receiving a message from Node13, Node14 acquires an RLock on session.parentsLock (in session.RecvPacket) and tries to establish a relay to Node1. This invokes server::Stream on Node1 which tries to acquire a write lock on session.parentsLock. This write lock cannot be acquired until the RLock held by Node1 in session.RecvPacket (1) is released.
  3. Meanwhile Node1 tries to send a message to Node14 and set up a relay to it. This invokes server::Stream on Node14 which tries to acquire a write lock on session.parentsLock. This write lock cannot be acquired until the RLock held by Node14 in session.RecvPacket (2) is released.

gnarula added a commit that referenced this issue Jan 5, 2021
gnarula added a commit that referenced this issue Jan 5, 2021
@pierluca pierluca added the bug Something isn't working label Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants