Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve collator networking #1526

Closed
wants to merge 2 commits into from
Closed

Conversation

rphmeier
Copy link
Contributor

@rphmeier rphmeier commented Sep 12, 2023

Fixes #1525

This alters the behavior of collators. We now only disconnect from reserved peers between leaves (after new candidates are likely to have been submitted). This avoids some race conditions such as #1499 , where disconnections were being immediately followed by reconnections.

@rphmeier rphmeier added T0-node This PR/Issue is related to the topic “node”. T8-parachains_engineering labels Sep 12, 2023
@rphmeier rphmeier added A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix and removed A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix labels Sep 12, 2023
@rphmeier rphmeier force-pushed the rh-improve-collator-networking branch from 30ea4f7 to 9a53e98 Compare September 13, 2023 04:34
@rphmeier
Copy link
Contributor Author

Somehow this made things substantially worse. Not sure why. it seems to be the worst of both worlds

@rphmeier
Copy link
Contributor Author

Actually, it seems much better now. Perhaps we were seeing another issue related to #1517 which is apparently causing validators to crash. Will hold out for better results. One change I can imagine is tweaking the timer to be 2.5s, which should work even better than 4s as it gives 3.5s for connections to be fully closed before reopening.

@dmitry-markin
Copy link
Contributor

I'm not sure if adding more time will help connections to "fully close". If the node is not sending anything over the substream during that time, it won't be able to detect that the substream was closed and accept a new one.

@rphmeier
Copy link
Contributor Author

rphmeier commented Oct 5, 2023

I'm not sure if adding more time will help connections to "fully close". If the node is not sending anything over the substream during that time, it won't be able to detect that the substream was closed and accept a new one.

It seemed to help in practice, but can you elaborate? How do you force a substream to close if disconnect doesn't work? Immediately send an empty notification over it?

@dmitry-markin
Copy link
Contributor

I'm not sure if adding more time will help connections to "fully close". If the node is not sending anything over the substream during that time, it won't be able to detect that the substream was closed and accept a new one.

It seemed to help in practice, but can you elaborate? How do you force a substream to close if disconnect doesn't work? Immediately send an empty notification over it?

disconnect does work, but the other side only learns that the stream is closed after sending a notification into it. That's on the application level, on libp2p protocol level the node is instantly aware of the closed substream. Sending an empty notification should work. We also have a fix for the remote not learning instantly that the substream is closed: paritytech/substrate#13396 (reverted because it made insufficient inbound slots issue manifest, but that should be fixed now).

@altonen could you remind why you are against merging this fix now?

@altonen
Copy link
Contributor

altonen commented Oct 9, 2023

@dmitry-markin

Which PR fixed the insufficient inbound slots issue? We adjusted the ratios a while back but I don't count that as a fix and AFAIR there hasn't been any direct PR addressing the issue because it's not something we can fix in sc-network.

I'm not against merging the PR but it's more of a hack than an actual fix. IMO proper way to fix this would be to detect when the inbound substream is closed and then consider the connection closed.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Oct 9, 2023

Which PR fixed the insufficient inbound slots issue? We adjusted the ratios a while back but I don't count that as a fix and AFAIR there hasn't been any direct PR addressing the issue because it's not something we can fix in sc-network.

I merely meant adjusting the ratios when a wrote about the fix.

I'm not against merging the PR but it's more of a hack than an actual fix. IMO proper way to fix this would be to detect when the inbound substream is closed and then consider the connection closed.

Thanks for the comment, I forgot the discussion about not allowing half-open notifications substreams, and that was essentially what I was asking about.

@bkchr
Copy link
Member

bkchr commented Jul 17, 2024

@dmitry-markin please close if not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve validators_buffer behavior in collator protocol
6 participants