Massive leak of active handles #141

dapplion · 2021-02-24T17:33:33Z

Testing Lodestar in a powerful CI we've observed a massive leak of active handles originating at our p2p port (9000). Lodestar uses only the TCP transport, thus I'm opening the issue here.

The peer count however is stable at 25, and running wtfnode in that situation returns ~600 - 25 lines with:

- Sockets:
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined
  - 192.168.16.2:9000 -> undefined:undefined

From wtfnode source and NodeJS net docs that indicates sockets in an intermediate not-connected state. My guess is that TCP does not handle some edge cases and sockets are left hanging.

Have you experienced something like this in your application? This is a big issue for Lodestar since we are very resource constraint.

CC @wemeetagain

The text was updated successfully, but these errors were encountered:

wemeetagain · 2021-02-24T18:21:47Z

It may be connections/sockets that are closed but still referenced somewhere and not gc'd.

dapplion · 2021-02-24T20:55:20Z

I've run some local tests, and once you destroy a socket it will no longer appear in the active handles. Even if you keep a strong reference to it.

I've extended wtfnode to print more info about the leaked sockets, and they are not destroyed and still in open state

  - 192.168.16.2:38732 -> undefined:undefined - destroyed=false pending=true connecting=true state=opening
    - Listeners:
      - connect: onConnect @ /usr/app/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.16.2:9000 -> 3.135.188.5:13000 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> 18.193.206.240:51472 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:54670 -> 69.115.16.114:13000 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> 34.75.78.237:13000 - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> undefined:undefined - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> undefined:undefined - destroyed=false pending=false connecting=false state=open
  - 192.168.16.2:9000 -> undefined:undefined - destroyed=false pending=false connecting=false state=open
 ....

dapplion · 2021-03-01T17:43:59Z

When the leak is observed the peer count is stable at 25 peers. However, there may be some rotation of peers, as the counter of total unique peer seen increases.

Note: charts below or from a different instance than above

jacobheun · 2021-03-02T12:56:27Z

@dapplion what version of Node is this running on?

It's interesting that this is fairly stable and then starts quickly leaking handles. You said this is running on CI? I'm wondering if there is a network event happening that's causing TCP to freak out. Is the timing (time of day) at all predictable?

dapplion · 2021-03-02T13:54:36Z

@jacobheun It's running on NodeJS v12.13.1

I haven't seen a correlation with time. Note tho that it's a very slow leak, so I don't have a ton of data.

vasco-santos · 2021-03-03T14:23:24Z

I am trying out to run a jsipfs daemon with the DHT random walk enabled to force some connections and see if I can replicate.

Also using wtfnode. Will report back later today

vasco-santos · 2021-03-03T20:47:24Z

I have kept a node running DHT queries over the afternoon using wtfnode and could not replicate this entirely so far. I fixed the maximum number of connections to 25.

Sometimes I get something like:

- Listeners:
      - connect: onConnect @ /Users/vsantos/work/pl/gh/libp2p/js-libp2p-daemon/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.1.120:64734 -> undefined:undefined

and in rare occasions I actually get 2-3 occurences:

- Listeners:
      - connect: onConnect @ /Users/vsantos/work/pl/gh/libp2p/js-libp2p-daemon/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.1.120:64640 -> undefined:undefined
- Listeners:
      - connect: onConnect @ /Users/vsantos/work/pl/gh/libp2p/js-libp2p-daemon/node_modules/libp2p-tcp/src/index.js:101
  - 192.168.1.120:64643 -> undefined:undefined

I run a wtf.dump() each 10 seconds, and these undefined:undefined hosts are not present in the next listeners report. They should probably be in a transition state (either opening or closing connection?). This has been running for 7 hours now, I will keep it a few more hours until going to sleep to see if that results on a lot of these cases.

I couldn't get to any weird states by playing with libp2p-tcp today, with aborts and forced errors but could not get to active handles in the end. @jacobheun will do some experiments tomorrow with this, starting with spamming dial/disconnects a node listening on several interfaces.

jacobheun · 2021-03-05T19:00:09Z

I have not been able to replicate the leak yet. I'm currently using a minimal configured libp2p node running Gossipsub, with a low cap (10) on max connections.

The strategies I have been using thus far are:

A consistent peer set that thrashes their own connections to the target node, including pings and topic publishes.
A consistent peer set in excess of the connection manager limit that attempts to keep long lived connections to the target node, plus ephemeral nodes that shut down after performing a publish

Memory and handlers have been stable so far.

dapplion · 2021-03-06T10:22:52Z

I've kept running various nodes at different versions to pin-point when the leak starts happening and the results are very inconclusive. The leak happens arbitrarily without correlation with network, resources, and version. I'll keep working on it.

dapplion · 2021-04-21T16:15:07Z

Bump, this is still happening consistently in our nodes. However it takes a full week to show up significantly,

All the other metrics of the node are stable, memory, cpu, peers, etc.

vasco-santos · 2021-04-26T12:53:24Z

@dapplion I am wondering if this is related to multiple closes without awaiting per #140

dapplion · 2021-04-26T13:09:23Z

@dapplion I am wondering if this is related to multiple closes without awaiting per #140

Maybe? What hypothesis do you have around it?

vasco-santos · 2021-04-26T14:46:55Z

Perhaps a race condition that does not get into the the close event. For example, if two close happens and timeout, the first will destroy and the second does not destroy as it is destroyed already. The close event listener will be kept there.

I think we should just have a isClosing promise, which is stored, and we return it in subsequent closes instead of the fix in #140. This issue also was reported a few weeks after 140, so perhaps it is a side effect of it. What do you think?

dapplion · 2021-04-26T17:12:43Z

I think we should just have a isClosing promise, which is stored, and we return it in subsequent closes instead of the fix in #140. This issue also was reported a few weeks after 140, so perhaps it is a side effect of it. What do you think?

If the socket is already destroyed how can they keep accumulating? The metrics clearly show that the number of active non-destroyed sockets increases. If destroyed multiple times they may leak handlers and events but not sockets.

vasco-santos · 2021-04-26T19:59:42Z

The metrics clearly show that the number of active non-destroyed sockets increases

I have seen the metrics for handlers, not for leaking sockets. So, you mean there are tcp sockets hanging?

dapplion · 2021-04-26T21:15:14Z

So, you mean there are tcp sockets hanging?

Yes! That's what I've referred from the start. Sorry if I haven't been clear. The the wtfnode reports above

dapplion · 2021-12-16T08:10:51Z

Bump, issue still active in Lodestar nodes ChainSafe/lodestar#3526

twoeths · 2021-12-24T03:37:23Z

Looks like the retained socket is the one that was aborted when libp2p DialRequest has multiple address, it'd keep only once and abort all remaining

js-libp2p-tcp/src/index.js

Line 71 in 1faa587

const rawSocket = net.connect(cOpts)

I can see the AbortError here

achingbrain · 2022-09-06T15:22:51Z

This has been verified as fixed, please re-open if it re-occurs.

aschmahmann assigned vasco-santos Mar 1, 2021

jacobheun added kind/bug A bug in existing code (including security flaws) status/ready Ready to be worked labels Mar 2, 2021

aschmahmann removed the status/ready Ready to be worked label Mar 8, 2021

vasco-santos mentioned this issue Mar 15, 2021

Possible EventTarget Memory Leak Detected libp2p/js-libp2p#900

Closed

BigLep added the need/analysis Needs further analysis before proceeding label Mar 21, 2021

vasco-santos added need/triage Needs initial labeling and prioritization need/analysis Needs further analysis before proceeding and removed need/analysis Needs further analysis before proceeding need/triage Needs initial labeling and prioritization labels Apr 21, 2021

lidel added effort/weeks Estimated to take multiple weeks exp/wizard Extensive knowledge (implications, ramifications) required labels Apr 26, 2021

dapplion mentioned this issue Feb 7, 2022

Investigate why there are a lot of open sockets ChainSafe/lodestar#3526

Closed

stallingerl mentioned this issue Mar 7, 2022

Messages zwischen Peers über PubSub werden nicht empfangen webanizer/BlockPro#13

Closed

achingbrain mentioned this issue Sep 1, 2022

Sockets leak #201

Closed

achingbrain closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive leak of active handles #141

Massive leak of active handles #141

dapplion commented Feb 24, 2021 •

edited

Loading

wemeetagain commented Feb 24, 2021

dapplion commented Feb 24, 2021 •

edited

Loading

dapplion commented Mar 1, 2021 •

edited

Loading

jacobheun commented Mar 2, 2021

dapplion commented Mar 2, 2021

vasco-santos commented Mar 3, 2021

vasco-santos commented Mar 3, 2021 •

edited

Loading

jacobheun commented Mar 5, 2021

dapplion commented Mar 6, 2021

dapplion commented Apr 21, 2021 •

edited

Loading

vasco-santos commented Apr 26, 2021

dapplion commented Apr 26, 2021

vasco-santos commented Apr 26, 2021

dapplion commented Apr 26, 2021

vasco-santos commented Apr 26, 2021

dapplion commented Apr 26, 2021

dapplion commented Dec 16, 2021

twoeths commented Dec 24, 2021

achingbrain commented Sep 6, 2022

Massive leak of active handles #141

Massive leak of active handles #141

Comments

dapplion commented Feb 24, 2021 • edited Loading

wemeetagain commented Feb 24, 2021

dapplion commented Feb 24, 2021 • edited Loading

dapplion commented Mar 1, 2021 • edited Loading

jacobheun commented Mar 2, 2021

dapplion commented Mar 2, 2021

vasco-santos commented Mar 3, 2021

vasco-santos commented Mar 3, 2021 • edited Loading

jacobheun commented Mar 5, 2021

dapplion commented Mar 6, 2021

dapplion commented Apr 21, 2021 • edited Loading

vasco-santos commented Apr 26, 2021

dapplion commented Apr 26, 2021

vasco-santos commented Apr 26, 2021

dapplion commented Apr 26, 2021

vasco-santos commented Apr 26, 2021

dapplion commented Apr 26, 2021

dapplion commented Dec 16, 2021

twoeths commented Dec 24, 2021

achingbrain commented Sep 6, 2022

dapplion commented Feb 24, 2021 •

edited

Loading

dapplion commented Feb 24, 2021 •

edited

Loading

dapplion commented Mar 1, 2021 •

edited

Loading

vasco-santos commented Mar 3, 2021 •

edited

Loading

dapplion commented Apr 21, 2021 •

edited

Loading