Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High rate of dial failures / peers not connected for request/response on Versi #589

Open
rphmeier opened this issue Aug 1, 2023 · 8 comments
Assignees
Labels
I2-bug The node fails to follow expected behavior.

Comments

@rphmeier
Copy link
Contributor

rphmeier commented Aug 1, 2023

We are seeing high rates of request failures and discovery failures on recent master nodes. These appear in the logs as warnings. This may be related to the libp2p upgrade in 01fd49a7fafa01f133e2dec538a2ef7c697a26aa or the logic changes to the discovery system in 1346281e1a12958bb08d5fcf55c7563750719388

Rates of validator connectivity are also lower than expected.

This is possibly only an issue between master nodes and paritytech/substrate#5022 nodes, though that is yet to be confirmed.

If this is a paritytech/substrate#5022 issue, then the most likely culprit is some misconfiguration in the peer-set: https://github.com/paritytech/polkadot/pull/6782/files#diff-01ac05045f5aef4678a1579846a54002dc6fd86cd6747d4232f0245e04d7ae5d

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 1, 2023

@vstakhov soft-assigned you, any thoughts?

@altonen
Copy link
Contributor

altonen commented Aug 1, 2023

Dial failures are possibly related to this: #498

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 1, 2023

After testing master, there are some meaningful differences between peer counts, but the behavior is strange.

In scenario 1: 100 async backing nodes (ab), 87 master nodes (cde):

  • average peer count for (ab) nodes was 132-133
  • average peer count for (cde) nodes was 105

In scenario 2: 187 master nodes (abcde):

  • average peer count for all nodes was 132-133

No peer is connected to all the other validators. Being unable to connect to 30% of other nodes seems quite stable. While async backing nodes were running, it seems that master-based nodes have a hard time connecting to them, but not vice-versa. Not sure how that works.

I'd guess we're dealing with two simultaneous issues:

  1. General connectivity issue
  2. Network protocol upgrade connectivity issue

@vstakhov
Copy link
Contributor

vstakhov commented Aug 1, 2023

Can it be related to an outdated runtime? I see quite a lot of errors like runtime: length of a bounded vector in scope Warning: The network has more peers than expected A runtime configuration adjustment may be needed. is not respected. But this particular frame has been reworked here: paritytech/substrate#14251
Versi now uses Westend runtime, just in case.

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 1, 2023

That could be. Let's see how it runs with an updated runtime (using Westend runtime is a bit odd, but as long as the network doesn't go down, no problem). At least worth ruling out any confounding factors.

@vstakhov
Copy link
Contributor

vstakhov commented Aug 2, 2023

It seems that the problems with the network got resolved after libp2p downgrade.

@vstakhov
Copy link
Contributor

vstakhov commented Aug 3, 2023

It seems that this PR has resolved the issue with the upgraded libp2p: paritytech/substrate#14703

@rphmeier
Copy link
Contributor Author

rphmeier commented Aug 3, 2023

This is fairly high-priority - could get reviewed & merged today, hopefully?

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I2-bug The node fails to follow expected behavior. and removed I3-bug labels Aug 25, 2023
@altonen altonen mentioned this issue Sep 19, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

4 participants