-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers #12
Comments
It seems there are issues with the new rc regarding connections. Meanwhile trying to figure out what's wrong downgrade to last good version. See #12
keep us in the loop, v18 is an important release and we want to iron out all issues. Are you using bitswap by any chance? Another pointer is that i suspect there might be some bug in yamux that makes it incapable of responidng correctly to refusal to increase the window, but thats still only a theory at this point. |
Sure will do 👍 , thanks!
Nope, things here are relatively much more simple as we just send over one block to the nodes (don't implement any real PoW, but just using it as a sync mechanism) and there is no block syncing (yet?)... so it is more tight to libp2p core modules and simple pub/sub mechanism which are just extensions of libp2p samples
I'll keep my eyes open there, thanks for the hint! |
To downgrade libp2p, see mudler/edgevpn#12
can you also check whether mplex is involved? |
I'll give it a shot and try to collect as much info as possible, thanks for the pointers! The fact that nodes can't re-establish a connection afterwards should help trace it, I'll capture logs with libp2p component with debug loglevel and try to getting them in that exact moment to have a clearer picture of what's going on |
can you try either disabling mplex or with libp2p/go-mplex#99 ? |
Going to try that! Thanks ! Although I can test only later in the day as I'm afk now, letting you know as soon as I am at it and keeping you in the loop |
I'm sorry I didn't had time to get back at this yet during the weekend, I have still to setup my test environment to reproduce the issue as it is time-consuming process to do that manually (I observed this while set up kubernetes clusters on top of it, and it's the straightforward way for me to reproduce it). I'll look at it during the week and keep you posted |
I'm following up the discussions on the PRs, will cut down later a specific version with libp2p/go-libp2p#1350 and check it out |
I'm trying to setup a small automated test that I'm running on GHA to be able to narrow it down. It seems the problem is still there (https://github.com/mudler/edgevpn/runs/5432147596?check_suite_focus=true ) I'm trying to send over a file of 2GB between two nodes in the above. I will enhance it to able to collect pprof and libp2p debug logs too so to have a better view of it. This could have been also something flaky, the setup of the test right now is really simplicistic (at the moment is just bashism so it is a bit hard to debug. will move it to golang soon so I can make it more interesting scenario) |
you can also get logs with |
I just did a test in my homelab with multiple VMs and everything seems good here! I'll cut a rel and test it a bit more in a bigger scenario. Up to now the connection on the nodes seems back to stable, and no drops anymore at all! I'll keep you posted if I notice something strange |
I've cut v0.11.0 with libp2p 0.18.0-rc6, thanks! will keep you in the loop if I spot something |
great, thank you! |
Alright, seems while testing it on a bigger scale I'm seeing the same issues. Intermittently nodes are dropping off and not connecting back again. I've cut also a release of c3os with it, where the issue can be observed too https://github.com/c3os-io/c3os/releases/tag/v1.21.4-36 |
There seems to be a slightly difference, it seems to happen when I start to send over big chunk of data. It survives pings and other stuff just good. |
Also reverts rcmgr configurations See #12
ok disabling the rcmgr make everything work as usual, so it has to do with the defaults limits probably. I was using rcmgr defaults in my first attempts so maybe that was too much conservative indeed. I'm going to disable by default rcmgr and play around it until I get some good defaults by running some benchmarks and maybe reuse the same maxConns approach as in lotus to see if that's suits defaults for my case as well (might not fit very well on Pis, but shall see :) ). |
yeah, the default inbound conn limit is very conservative. |
After about a 30min of usage, I started to notice constants connection drops by peers node. The issue seems to be persisting as connections doesn't seems to be rebuilt between nodes automatically, leaving peers disconnected. the only workaround is restarting the service.
This seems to be tied with the recent libp2p bump to
0.18.0-rc5
. I'm not sure if it's due to rsmngr configuration or either something else. I can't still trace it, but this is what I'm seeing now at a behavioral level:while opening a bunch of multiple streams to a single connection the connection gets eventually killed and seems the node can't recover and connect to it again.
Although this seems to be an issue even with small streams - where I was previously pushing GBs of traffic just fine between nodes, now doesn't hold even for simple http requests.
@vyzo / @marten-seemann sorry guys to ping you directly again, and don't want to sound annoying either. I'm seeing weird issues with 0.18.0 -rc5 here. I'm not sure if it's due to
rsmngr
configuration or either something else. I can't still trace it and give some helpful debug information, but this is what I'm seeing now at a behavioral level, the effect is quite noticeable.The text was updated successfully, but these errors were encountered: