Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers #12

mudler · 2022-02-24T16:53:21Z

After about a 30min of usage, I started to notice constants connection drops by peers node. The issue seems to be persisting as connections doesn't seems to be rebuilt between nodes automatically, leaving peers disconnected. the only workaround is restarting the service.

This seems to be tied with the recent libp2p bump to 0.18.0-rc5. I'm not sure if it's due to rsmngr configuration or either something else. I can't still trace it, but this is what I'm seeing now at a behavioral level:

while opening a bunch of multiple streams to a single connection the connection gets eventually killed and seems the node can't recover and connect to it again.

Although this seems to be an issue even with small streams - where I was previously pushing GBs of traffic just fine between nodes, now doesn't hold even for simple http requests.

@vyzo / @marten-seemann sorry guys to ping you directly again, and don't want to sound annoying either. I'm seeing weird issues with 0.18.0 -rc5 here. I'm not sure if it's due to rsmngr configuration or either something else. I can't still trace it and give some helpful debug information, but this is what I'm seeing now at a behavioral level, the effect is quite noticeable.

The text was updated successfully, but these errors were encountered:

It seems there are issues with the new rc regarding connections. Meanwhile trying to figure out what's wrong downgrade to last good version. See #12

vyzo · 2022-02-24T17:03:54Z

keep us in the loop, v18 is an important release and we want to iron out all issues.

Are you using bitswap by any chance?

Another pointer is that i suspect there might be some bug in yamux that makes it incapable of responidng correctly to refusal to increase the window, but thats still only a theory at this point.

mudler · 2022-02-24T17:10:10Z

keep us in the loop, v18 is an important release and we want to iron out all issues.

Sure will do 👍 , thanks!

Are you using bitswap by any chance?

Nope, things here are relatively much more simple as we just send over one block to the nodes (don't implement any real PoW, but just using it as a sync mechanism) and there is no block syncing (yet?)... so it is more tight to libp2p core modules and simple pub/sub mechanism which are just extensions of libp2p samples

Another pointer is that i suspect there might be some bug in yamux that makes it incapable of responidng correctly to refusal to increase the window, but thats still only a theory at this point.

I'll keep my eyes open there, thanks for the hint!

To downgrade libp2p, see mudler/edgevpn#12

vyzo · 2022-02-25T07:07:42Z

can you also check whether mplex is involved?
You probably dont need it at all, can you try limiting the muxer to just yamux?

mudler · 2022-02-25T07:46:49Z

can you also check whether mplex is involved? You probably dont need it at all, can you try limiting the muxer to just yamux?

I'll give it a shot and try to collect as much info as possible, thanks for the pointers! The fact that nodes can't re-establish a connection afterwards should help trace it, I'll capture logs with libp2p component with debug loglevel and try to getting them in that exact moment to have a clearer picture of what's going on

vyzo · 2022-02-27T10:00:13Z

can you try either disabling mplex or with libp2p/go-mplex#99 ?

mudler · 2022-02-27T10:37:28Z

can you try either disabling mplex or with libp2p/go-mplex#99 ?

Going to try that! Thanks ! Although I can test only later in the day as I'm afk now, letting you know as soon as I am at it and keeping you in the loop

mudler · 2022-03-01T13:44:41Z

I'm sorry I didn't had time to get back at this yet during the weekend, I have still to setup my test environment to reproduce the issue as it is time-consuming process to do that manually (I observed this while set up kubernetes clusters on top of it, and it's the straightforward way for me to reproduce it). I'll look at it during the week and keep you posted

mudler · 2022-03-04T09:18:04Z

I'm following up the discussions on the PRs, will cut down later a specific version with libp2p/go-libp2p#1350 and check it out

See: #12

mudler · 2022-03-05T10:31:30Z

I'm trying to setup a small automated test that I'm running on GHA to be able to narrow it down. It seems the problem is still there (https://github.com/mudler/edgevpn/runs/5432147596?check_suite_focus=true ) I'm trying to send over a file of 2GB between two nodes in the above.

I will enhance it to able to collect pprof and libp2p debug logs too so to have a better view of it. This could have been also something flaky, the setup of the test right now is really simplicistic (at the moment is just bashism so it is a bit hard to debug. will move it to golang soon so I can make it more interesting scenario)

vyzo · 2022-03-05T11:57:58Z

you can also get logs with GOLOG_LOG_LEVEL=debug, there should be some hints there.

mudler · 2022-03-16T21:34:52Z

I just did a test in my homelab with multiple VMs and everything seems good here! I'll cut a rel and test it a bit more in a bigger scenario. Up to now the connection on the nodes seems back to stable, and no drops anymore at all! I'll keep you posted if I notice something strange

mudler · 2022-03-17T18:16:27Z

I've cut v0.11.0 with libp2p 0.18.0-rc6, thanks! will keep you in the loop if I spot something

vyzo · 2022-03-17T18:28:16Z

great, thank you!

mudler · 2022-03-17T23:28:28Z

Alright, seems while testing it on a bigger scale I'm seeing the same issues. Intermittently nodes are dropping off and not connecting back again. I've cut also a release of c3os with it, where the issue can be observed too https://github.com/c3os-io/c3os/releases/tag/v1.21.4-36

mudler · 2022-03-17T23:29:45Z

There seems to be a slightly difference, it seems to happen when I start to send over big chunk of data. It survives pings and other stuff just good.

Also reverts rcmgr configurations See #12

mudler · 2022-03-18T16:06:28Z

ok disabling the rcmgr make everything work as usual, so it has to do with the defaults limits probably. I was using rcmgr defaults in my first attempts so maybe that was too much conservative indeed.

I'm going to disable by default rcmgr and play around it until I get some good defaults by running some benchmarks and maybe reuse the same maxConns approach as in lotus to see if that's suits defaults for my case as well (might not fit very well on Pis, but shall see :) ).

vyzo · 2022-03-18T16:51:32Z

yeah, the default inbound conn limit is very conservative.

mudler added a commit that referenced this issue Feb 24, 2022

⬇️ Revert libp2p bump

2e8c6d5

It seems there are issues with the new rc regarding connections. Meanwhile trying to figure out what's wrong downgrade to last good version. See #12

mudler changed the title ~~Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5~~ Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers Feb 24, 2022

mudler added a commit to kairos-io/kairos that referenced this issue Feb 24, 2022

⬆️ Bump internal version

1140cb0

To downgrade libp2p, see mudler/edgevpn#12

mudler added a commit that referenced this issue Mar 4, 2022

⬆️ Bump libp2p to 0.18.0-rc6

3f03e60

See: #12

mudler closed this as completed Mar 17, 2022

mudler reopened this Mar 17, 2022

mudler added a commit that referenced this issue Mar 17, 2022

⬇️ Revert libp2p bump

abfb214

Also reverts rcmgr configurations See #12

mudler closed this as completed in e0ccd8c Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers #12

Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers #12

mudler commented Feb 24, 2022 •

edited

Loading

vyzo commented Feb 24, 2022

mudler commented Feb 24, 2022 •

edited

Loading

vyzo commented Feb 25, 2022

mudler commented Feb 25, 2022

vyzo commented Feb 27, 2022

mudler commented Feb 27, 2022

mudler commented Mar 1, 2022

mudler commented Mar 4, 2022

mudler commented Mar 5, 2022 •

edited

Loading

vyzo commented Mar 5, 2022

mudler commented Mar 16, 2022

mudler commented Mar 17, 2022

vyzo commented Mar 17, 2022

mudler commented Mar 17, 2022

mudler commented Mar 17, 2022

mudler commented Mar 18, 2022

vyzo commented Mar 18, 2022

Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers #12

Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers #12

Comments

mudler commented Feb 24, 2022 • edited Loading

vyzo commented Feb 24, 2022

mudler commented Feb 24, 2022 • edited Loading

vyzo commented Feb 25, 2022

mudler commented Feb 25, 2022

vyzo commented Feb 27, 2022

mudler commented Feb 27, 2022

mudler commented Mar 1, 2022

mudler commented Mar 4, 2022

mudler commented Mar 5, 2022 • edited Loading

vyzo commented Mar 5, 2022

mudler commented Mar 16, 2022

mudler commented Mar 17, 2022

vyzo commented Mar 17, 2022

mudler commented Mar 17, 2022

mudler commented Mar 17, 2022

mudler commented Mar 18, 2022

vyzo commented Mar 18, 2022

mudler commented Feb 24, 2022 •

edited

Loading

mudler commented Feb 24, 2022 •

edited

Loading

mudler commented Mar 5, 2022 •

edited

Loading