Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Massive leak of active handles #141

Closed
dapplion opened this issue Feb 24, 2021 · 19 comments
Closed

Massive leak of active handles #141

dapplion opened this issue Feb 24, 2021 · 19 comments
Assignees
Labels
effort/weeks Estimated to take multiple weeks exp/wizard Extensive knowledge (implications, ramifications) required kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding

Comments

@dapplion
Copy link
Contributor

dapplion commented Feb 24, 2021

Testing Lodestar in a powerful CI we've observed a massive leak of active handles originating at our p2p port (9000). Lodestar uses only the TCP transport, thus I'm opening the issue here.

Screenshot from 2021-02-24 18-27-26

The peer count however is stable at 25, and running wtfnode in that situation returns ~600 - 25 lines with:

- Sockets:
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined

From wtfnode source and NodeJS net docs that indicates sockets in an intermediate not-connected state. My guess is that TCP does not handle some edge cases and sockets are left hanging.

Have you experienced something like this in your application? This is a big issue for Lodestar since we are very resource constraint.

CC @wemeetagain

@wemeetagain
Copy link
Member

It may be connections/sockets that are closed but still referenced somewhere and not gc'd.

@dapplion
Copy link
Contributor Author

dapplion commented Feb 24, 2021

I've run some local tests, and once you destroy a socket it will no longer appear in the active handles. Even if you keep a strong reference to it.

I've extended wtfnode to print more info about the leaked sockets, and they are not destroyed and still in open state

  - 192.168.16.2:38732 -> undefined:undefined - destroyed=false pending=true connecting=true state=opening
    - Listeners:
      - connect: onConnect @ /usr/app/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.16.2:9000 -> 3.135.188.5:13000 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> 18.193.206.240:51472 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:54670 -> 69.115.16.114:13000 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> 34.75.78.237:13000 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> undefined:undefined - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> undefined:undefined - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> undefined:undefined - destroyed=false pending=false connecting=false state=open
 ....

@dapplion
Copy link
Contributor Author

dapplion commented Mar 1, 2021

When the leak is observed the peer count is stable at 25 peers. However, there may be some rotation of peers, as the counter of total unique peer seen increases.

Note: charts below or from a different instance than above

Screenshot from 2021-03-01 18-39-19

Screenshot from 2021-03-01 18-37-58

Screenshot from 2021-03-01 18-40-54

Screenshot from 2021-03-01 18-43-33

@jacobheun jacobheun added kind/bug A bug in existing code (including security flaws) status/ready Ready to be worked labels Mar 2, 2021
@jacobheun
Copy link
Contributor

@dapplion what version of Node is this running on?

It's interesting that this is fairly stable and then starts quickly leaking handles. You said this is running on CI? I'm wondering if there is a network event happening that's causing TCP to freak out. Is the timing (time of day) at all predictable?

@dapplion
Copy link
Contributor Author

dapplion commented Mar 2, 2021

@jacobheun It's running on NodeJS v12.13.1

I haven't seen a correlation with time. Note tho that it's a very slow leak, so I don't have a ton of data.

@vasco-santos
Copy link
Member

I am trying out to run a jsipfs daemon with the DHT random walk enabled to force some connections and see if I can replicate.

Also using wtfnode. Will report back later today

@vasco-santos
Copy link
Member

vasco-santos commented Mar 3, 2021

I have kept a node running DHT queries over the afternoon using wtfnode and could not replicate this entirely so far. I fixed the maximum number of connections to 25.

Sometimes I get something like:

- Listeners:
      - connect: onConnect @ /Users/vsantos/work/pl/gh/libp2p/js-libp2p-daemon/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.1.120:64734 -> undefined:undefined

and in rare occasions I actually get 2-3 occurences:

- Listeners:
      - connect: onConnect @ /Users/vsantos/work/pl/gh/libp2p/js-libp2p-daemon/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.1.120:64640 -> undefined:undefined
- Listeners:
      - connect: onConnect @ /Users/vsantos/work/pl/gh/libp2p/js-libp2p-daemon/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.1.120:64643 -> undefined:undefined

I run a wtf.dump() each 10 seconds, and these undefined:undefined hosts are not present in the next listeners report. They should probably be in a transition state (either opening or closing connection?). This has been running for 7 hours now, I will keep it a few more hours until going to sleep to see if that results on a lot of these cases.

I couldn't get to any weird states by playing with libp2p-tcp today, with aborts and forced errors but could not get to active handles in the end. @jacobheun will do some experiments tomorrow with this, starting with spamming dial/disconnects a node listening on several interfaces.

@jacobheun
Copy link
Contributor

I have not been able to replicate the leak yet. I'm currently using a minimal configured libp2p node running Gossipsub, with a low cap (10) on max connections.

The strategies I have been using thus far are:

  1. A consistent peer set that thrashes their own connections to the target node, including pings and topic publishes.
  2. A consistent peer set in excess of the connection manager limit that attempts to keep long lived connections to the target node, plus ephemeral nodes that shut down after performing a publish

Memory and handlers have been stable so far.

@dapplion
Copy link
Contributor Author

dapplion commented Mar 6, 2021

I've kept running various nodes at different versions to pin-point when the leak starts happening and the results are very inconclusive. The leak happens arbitrarily without correlation with network, resources, and version. I'll keep working on it.

@aschmahmann aschmahmann removed the status/ready Ready to be worked label Mar 8, 2021
@BigLep BigLep added the need/analysis Needs further analysis before proceeding label Mar 21, 2021
@dapplion
Copy link
Contributor Author

dapplion commented Apr 21, 2021

Bump, this is still happening consistently in our nodes. However it takes a full week to show up significantly,

All the other metrics of the node are stable, memory, cpu, peers, etc.

Screenshot from 2021-04-21 18-15-20

@vasco-santos vasco-santos added need/triage Needs initial labeling and prioritization need/analysis Needs further analysis before proceeding and removed need/analysis Needs further analysis before proceeding need/triage Needs initial labeling and prioritization labels Apr 21, 2021
@vasco-santos
Copy link
Member

@dapplion I am wondering if this is related to multiple closes without awaiting per #140

@dapplion
Copy link
Contributor Author

@dapplion I am wondering if this is related to multiple closes without awaiting per #140

Maybe? What hypothesis do you have around it?

@vasco-santos
Copy link
Member

Perhaps a race condition that does not get into the the close event. For example, if two close happens and timeout, the first will destroy and the second does not destroy as it is destroyed already. The close event listener will be kept there.

I think we should just have a isClosing promise, which is stored, and we return it in subsequent closes instead of the fix in #140. This issue also was reported a few weeks after 140, so perhaps it is a side effect of it. What do you think?

@lidel lidel added effort/weeks Estimated to take multiple weeks exp/wizard Extensive knowledge (implications, ramifications) required labels Apr 26, 2021
@dapplion
Copy link
Contributor Author

I think we should just have a isClosing promise, which is stored, and we return it in subsequent closes instead of the fix in #140. This issue also was reported a few weeks after 140, so perhaps it is a side effect of it. What do you think?

If the socket is already destroyed how can they keep accumulating? The metrics clearly show that the number of active non-destroyed sockets increases. If destroyed multiple times they may leak handlers and events but not sockets.

@vasco-santos
Copy link
Member

The metrics clearly show that the number of active non-destroyed sockets increases

I have seen the metrics for handlers, not for leaking sockets. So, you mean there are tcp sockets hanging?

@dapplion
Copy link
Contributor Author

So, you mean there are tcp sockets hanging?

Yes! That's what I've referred from the start. Sorry if I haven't been clear. The the wtfnode reports above

@dapplion
Copy link
Contributor Author

Bump, issue still active in Lodestar nodes ChainSafe/lodestar#3526

146298631-5a03b2bf-fdce-4cbd-b2e0-fac4d1fc290f

@twoeths
Copy link
Contributor

twoeths commented Dec 24, 2021

Looks like the retained socket is the one that was aborted when libp2p DialRequest has multiple address, it'd keep only once and abort all remaining

const rawSocket = net.connect(cOpts)

I can see the AbortError here
Screen Shot 2021-12-24 at 10 36 20

@achingbrain
Copy link
Member

This has been verified as fixed, please re-open if it re-occurs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
effort/weeks Estimated to take multiple weeks exp/wizard Extensive knowledge (implications, ramifications) required kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding
Projects
Archived in project
Development

No branches or pull requests

9 participants