Send PeerViewChange with high priority #4755

AndreiEres · 2024-06-11T09:08:02Z

Closes #577

Changed

orchestra updated to 0.4.0
PeerViewChange sent with high priority and should be processed first in a queue.
To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though.

Testing on Versi

The changes were tested on Versi with two objectives:

Make sure the node functionality does not change.
See how the changes affect performance.

Test setup:

2.5 hours for each case
100 validators
50 parachains
validatorsPerCore = 2
neededApprovals = 100
nDelayTranches = 89
relayVrfModuloSamples = 50

During the test period, all nodes ran without any crashes, which satisfies the first objective.

To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct.

Normalized charts with ToF

Before

After

Conclusion

The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.

eskimor · 2024-06-11T09:12:43Z

This definitely needs to be tested under high load.

It would be good to measure actual improvements under heavy load, compared to master. E.g. ideally have some reproducible finality lag or even crashes, that goes away with this PR.
Only under high load this feature will make a difference. So also any bugs will only be found in a high load scenario.

To get heavy load you can spawn a good amount of parachains and use malus nodes to create a dispute storm.

alexggh · 2024-06-11T10:53:59Z

This definitely needs to be tested under high load.

It would be good to measure actual improvements under heavy load, compared to master. E.g. ideally have some reproducible finality lag or even crashes, that goes away with this PR.

This PR should help in the situations where we have a high backlog of un-processed messages in approval-distribution and it helps with the fact that we will gossip messages to some peers as soon as we received the PeerViewChange instead of when we caught up with the processing. This has the benefits that peers will faster reached the conclusion that enough assignments have been triggered they won't trigger their own assignment.

We can probably simulate with the approval-voting subsystem-benchmark such a scenario and measure how fast our node gossip the assignments it received to it is peers.

sandreim · 2024-06-11T11:02:10Z

This definitely needs to be tested under high load.

It would be good to measure actual improvements under heavy load, compared to master. E.g. ideally have some reproducible finality lag or even crashes, that goes away with this PR.

The main benefit of this change is that we keep gossip protocols in better sync with what other peers work on. I expect this to lead to improved gossip propagation times -> lower approval checking lag, less needless work (like @alexggh said above, we won't trigger assignments while a sufficient other assignments sit in our queue).

Only under high load this feature will make a difference. So also any bugs will only be found in a high load scenario.

I think this is mandatory. AFAIR we had some trouble in the past when switching to async channel, maybe it's a good time to try it and maybe get some speedup there as well.

To get heavy load you can spawn a good amount of parachains and use malus nodes to create a dispute storm.

Yeah, like good old times on Versi :)

sandreim · 2024-06-19T10:24:57Z

polkadot/node/subsystem-test-helpers/src/lib.rs

@@ -322,6 +339,45 @@ pub fn make_subsystem_context<M, S>(
 	make_buffered_subsystem_context(spawner, 0)
 }

+/// Message counter over subsystems.
+#[derive(Default, Clone)]
+pub struct MessageCounter(Arc<Mutex<MessageCounterInner>>);


Since all of the inner stuff is counters, we can drop the mutex and use atomics.

alexggh · 2024-06-28T07:53:33Z

polkadot/node/network/bridge/src/rx/mod.rs

+	macro_rules! send_message {
+		($event:expr, $message:ident) => {
+			if let Ok(event) = $event.focus() {
+				let has_high_priority = matches!(event, NetworkBridgeEvent::PeerViewChange(..));


Thought about this a bit more, on side effect of this is that we change the order of messages, for example if the system is loaded I can imagine a situation where the bridge sends the PeerConnected on the normal channel and then the PeerViewChange on the priority channel, so the PeerViewChange will arrive before the PeerConnected and it will be discarded.

I don't think it is a frequent problem, but nonetheless something this might cause, so I wonder if we need to make PeerConnected high priority as well.

Was it possible to receive PeerViewChange before PeerConnected because of some network problems?

AndreiEres · 2024-07-05T15:17:01Z

Tested on Versi, results posted to the first message

paritytech-cicd-pr · 2024-07-11T14:28:01Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 3/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6678198

alexggh

Overall looks good to me, left you some comments once those get addressed we should be good to go.

polkadot/node/network/bridge/src/rx/mod.rs

sandreim

LGTM! Nice work @AndreiEres , left just a few more nits.

sandreim · 2024-07-16T13:13:31Z

prdoc/pr_4755.prdoc

+title: Send PeerViewChange with high priority
+
+doc:
+  - audience: Node Operator


I think the Node developers should be the main target audience. The node operators would like want to hear if and how would this reduces the CPU utilization of the node.

sandreim · 2024-07-16T13:14:40Z

prdoc/pr_4755.prdoc

+doc:
+  - audience: Node Operator
+    description: |
+      - orchestra updated to 0.4.0


This can be more specific and mention why we upgrading it, in this case it should be the support for susbsytem message delivery prioritization.

* master: add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011)

Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF ![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5) [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1) ![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880) [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.

* master: (125 commits) add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011) network/tx: Ban peers with tx that fail to decode (#5002) Try State Hook for Bounties (#4563) [statement-distribution] Add metrics for distributed statements in V2 (#4554) added sync command (#4818) Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935) xcm-executor: Improve logging (#4996) Remove usage of `sp-std` on templates (#5001) ...

Closes paritytech#577 ### Changed - `orchestra` updated to 0.4.0 - `PeerViewChange` sent with high priority and should be processed first in a queue. - To count them in tests added tracker to TestSender and TestOverseer. It acts more like a smoke test though. ### Testing on Versi The changes were tested on Versi with two objectives: 1. Make sure the node functionality does not change. 2. See how the changes affect performance. Test setup: - 2.5 hours for each case - 100 validators - 50 parachains - validatorsPerCore = 2 - neededApprovals = 100 - nDelayTranches = 89 - relayVrfModuloSamples = 50 During the test period, all nodes ran without any crashes, which satisfies the first objective. To estimate the change in performance we used ToF charts. The graphs show that there are no spikes in the top as before. This proves that our hypothesis is correct. ### Normalized charts with ToF ![image](https://github.com/user-attachments/assets/0d49d0db-8302-4a8c-a557-501856805ff5) [Before](https://grafana.teleport.parity.io/goto/ZoR53ClSg?orgId=1) ![image](https://github.com/user-attachments/assets/9cc73784-7e45-49d9-8212-152373c05880) [After](https://grafana.teleport.parity.io/goto/6ux5qC_IR?orgId=1) ### Conclusion The prioritization of subsystem messages reduces the ToF of the networking subsystem, which helps faster propagation of gossip messages.

* master: (130 commits) add elastic scaling MVP guide (#4663) Send PeerViewChange with high priority (#4755) [ci] Update forklift in CI image (#5032) Adjust base value for statement-distribution regression tests (#5028) [pallet_contracts] Add support for transient storage in contracts host functions (#4566) [1 / 5] Optimize logic for gossiping assignments (#4848) Remove `pallet-getter` usage from pallet-session (#4972) command-action: added scoped permissions to the github tokens (#5016) net/litep2p: Propagate ValuePut events to the network backend (#5018) rpc: add back rpc logger (#4952) Updated substrate-relay version for tests (#5017) Remove most all usage of `sp-std` (#5010) Use sp_runtime::traits::BadOrigin (#5011) network/tx: Ban peers with tx that fail to decode (#5002) Try State Hook for Bounties (#4563) [statement-distribution] Add metrics for distributed statements in V2 (#4554) added sync command (#4818) Bridges V2 refactoring backport and `pallet_bridge_messages` simplifications (#4935) xcm-executor: Improve logging (#4996) Remove usage of `sp-std` on templates (#5001) ...

AndreiEres requested a review from alexggh June 11, 2024 09:08

AndreiEres added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jun 11, 2024

sandreim reviewed Jun 19, 2024

View reviewed changes

alexggh reviewed Jun 28, 2024

View reviewed changes

AndreiEres force-pushed the AndreiEres/priority-peer-view-change branch 3 times, most recently from f7508b9 to f1ebc26 Compare June 28, 2024 14:34

Update orchestra

33b293f

AndreiEres force-pushed the AndreiEres/priority-peer-view-change branch from f1ebc26 to 33b293f Compare June 28, 2024 14:35

AndreiEres added 2 commits June 28, 2024 16:53

Send PeerViewChange with priority

9a806b2

Add PR doc

6c6b22d

AndreiEres added 2 commits July 11, 2024 14:17

Add priority to PeerConnected

48aaa30

Update tests

51f728a

AndreiEres added 2 commits July 11, 2024 17:21

Update tests

88fb8e8

Update semver

c1fe63f

AndreiEres requested review from alexggh and sandreim July 15, 2024 13:00

alexggh reviewed Jul 16, 2024

View reviewed changes

polkadot/node/network/bridge/src/rx/mod.rs Outdated Show resolved Hide resolved

Prioritize PeerDisconnected

7f5e421

alexggh approved these changes Jul 16, 2024

View reviewed changes

sandreim approved these changes Jul 16, 2024

View reviewed changes

Update pr doc

35e44df

AndreiEres enabled auto-merge July 16, 2024 17:47

AndreiEres added this pull request to the merge queue Jul 16, 2024

Merged via the queue into master with commit 975e04b Jul 16, 2024
153 of 158 checks passed

AndreiEres deleted the AndreiEres/priority-peer-view-change branch July 16, 2024 18:56

This was referenced Aug 21, 2024

Update polkadot-sdk from v1.11.0 to stable2407 moondance-labs/tanssi#659

Open

Update polkadot-sdk from v1.11.0 to stable2407 moonbeam-foundation/moonbeam#2912

Closed

AndreiEres mentioned this pull request Sep 3, 2024

Avoid using unbounded channels for prioritization #824

Closed

TDemeco mentioned this pull request Oct 7, 2024

feat: ⏫ upgrade to Polkadot SDK stable2407 Moonsong-Labs/storage-hub#222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send PeerViewChange with high priority #4755

Send PeerViewChange with high priority #4755

AndreiEres commented Jun 11, 2024 •

edited

Loading

eskimor commented Jun 11, 2024

alexggh commented Jun 11, 2024

sandreim commented Jun 11, 2024

sandreim Jun 19, 2024

alexggh Jun 28, 2024

AndreiEres Jul 5, 2024

AndreiEres commented Jul 5, 2024

paritytech-cicd-pr commented Jul 11, 2024

alexggh left a comment

sandreim left a comment

sandreim Jul 16, 2024

sandreim Jul 16, 2024

Send PeerViewChange with high priority #4755

Send PeerViewChange with high priority #4755

Conversation

AndreiEres commented Jun 11, 2024 • edited Loading

Changed

Testing on Versi

Normalized charts with ToF

Conclusion

eskimor commented Jun 11, 2024

alexggh commented Jun 11, 2024

sandreim commented Jun 11, 2024

sandreim Jun 19, 2024

Choose a reason for hiding this comment

alexggh Jun 28, 2024

Choose a reason for hiding this comment

AndreiEres Jul 5, 2024

Choose a reason for hiding this comment

AndreiEres commented Jul 5, 2024

paritytech-cicd-pr commented Jul 11, 2024

alexggh left a comment

Choose a reason for hiding this comment

sandreim left a comment

Choose a reason for hiding this comment

sandreim Jul 16, 2024

Choose a reason for hiding this comment

sandreim Jul 16, 2024

Choose a reason for hiding this comment

AndreiEres commented Jun 11, 2024 •

edited

Loading