This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Delay reputation updates #7214

Merged

paritytech-processbot merged 52 commits into master from AndreiEres/reputation-delay

Jun 15, 2023

Contributor

AndreiEres commented May 11, 2023 •

edited

Loading

Problem

Nodes are sending numerous peer reputation changes, which is overloading or blocking the channels.

Hypothesis

We can address this issue by aggregating reputation changes and sending them to the network bridge in one batch every 30 seconds. However, there are certain changes that still require immediate reporting, so we will send those as individual messages right away.

Results

To test the hypothesis, we changed the sending of reputation changes in group A validators and compared the results with groups B, C, and D.

Group A sends changes in one batch, while others send them in individual messages. Thanos stopped working for some time, so there is no data at the end of the charts

In the left charts we see a huge difference in the number of messages sent. Group A sends about 200 times fewer messages.

However, the channel load on the right charts is almost unchanged. The peaks in group A are about 15% smaller.

Related

Fixes #7203

AndreiEres added B0-silent C1-low T4-parachains_engineering labels

sandreim reviewed

View reviewed changes

Contributor

sandreim left a comment •

edited

Loading

Looks like it is heading in the right direction. We should run Versi burn-ins after addressing some of the comments, especially the aggregation interval. I am interested in seeing the effects on the tx bridge queue size.

node/network/approval-distribution/src/lib.rs Outdated Show resolved Hide resolved

node/network/approval-distribution/src/lib.rs Outdated

+              	}
+              	pub fn update(&mut self, peer_id: PeerId, rep: Rep) {
+              		if matches!(rep, Rep::Malicious(_)) {

Contributor

sandreim May 25, 2023

We should handle this case in the modify_reputation function and keep the aggregator simple.

Contributor Author

AndreiEres May 25, 2023

What do you mean?
We flush the reputation aggregator outside of the state, so I saw only one way to mark a case when we need immediately send reputation — set the flag in the aggregator and check it in the outer loop.

node/network/approval-distribution/src/lib.rs Outdated

+              	mut reputation_delay: &mut Fuse<futures_timer::Delay>,
+              	reputation_interval: &std::time::Duration,
+              ) -> bool {
+              	if reputation_interval.is_zero() || reputation.overflow() {

Contributor

sandreim May 25, 2023

It would be better to use select! to handle message receive and timer futures in the subsystem main loop.

Contributor Author

AndreiEres May 25, 2023

Is it ok if the subsystem handles in one loop the whole process alongside with the reputation change?

node/network/approval-distribution/src/lib.rs Outdated

+              		return true
+              	}
+              	select! {

Contributor

sandreim May 25, 2023

This looks clunky, you wouldn't need it if you implement select! in main loop.

node/network/approval-distribution/src/lib.rs Outdated

+              ) {
+              	for (&peer_id, &score) in reputation.by_peer() {
+              		sender
+              			.send_message(NetworkBridgeTxMessage::ReportPeer(

Contributor

sandreim May 25, 2023

We could also consider sending the hashmap directly to the network bridge in a new message type ReportPeers, to avoid the overhead of sending many small messages.

Contributor Author

AndreiEres May 25, 2023

Yep.

node/network/approval-distribution/src/lib.rs Outdated

+              				peer_id,
+              				net_protocol::ReputationChange::new(
+              					score,
+              					"Reputation changed during approval-distribution",

Contributor

sandreim May 25, 2023

Logging individual reputation changes messages with trace! would be the better alternative. Anyway, a better description would be aggregated reputation change

node/network/approval-distribution/src/lib.rs Outdated

@@ @@ -1662,20 +1747,27 @@ impl ApprovalDistribution { @@
               	async fn run<Context>(self, ctx: Context) {
               		let mut state = State::default();
+              		let mut reputation = ReputationAggregator::new();
+              		let reputation_interval = std::time::Duration::from_millis(5);

Contributor

sandreim May 25, 2023

This is very low, likely no aggregation will actually happen, except maybe under load. I was thinking something like 30s to start with. We should do some Versi burn-ins to gauge the effectiveness of the aggregation.

node/network/approval-distribution/src/lib.rs Outdated

               		}
               	}
               	async fn handle_incoming<Context>(
               		ctx: &mut Context,
               		state: &mut State,
+              		reputation: &mut ReputationAggregator,

Contributor

sandreim May 25, 2023

I'd have the ReputationAggregator part of the State to avoid an extra argument to the function.

Contributor Author

AndreiEres May 25, 2023

Maybe we even don't need to use extra abstraction and can just hold the hashmap with aggregated reputations in the State?

AndreiEres force-pushed the AndreiEres/reputation-delay branch 2 times, most recently from 81092a7 to 1576e2c Compare

May 31, 2023 07:39

sandreim reviewed

View reviewed changes

node/network/bitfield-distribution/src/lib.rs Outdated

-              								// us anything to do with this relay-parent anyway.
-              								let _ = state.per_relay_parent.insert(
+              				// Will run if no futures are immediately ready
+              				default => {

Contributor

sandreim May 31, 2023

This looks conceptually wrong to me, as the loop would get stuck in the ctx.recv() until there is a message coming frm the outside. In this specific case, this happens every 6s at least (active leaves) or sooner due to peer messages.

We should remove the default here and write something like, so we poll both futures without blocking in waiting for a specific one.

message = ctx.recv().fuse() => {
             match message {
......

node/network/approval-distribution/src/lib.rs Outdated

-              							jaeger::PerLeafSpan::new(activated.span, "approval-distribution");
-              						state.spans.insert(head, approval_distribution_span);
+              				// Will run if no futures are immediately ready
+              				default => {

Contributor

sandreim May 31, 2023

same here

node/subsystem-util/src/reputation.rs Outdated

+              	}
+              	/// Adds reputation change to inner state,
+              	/// сhecks if the change is dangerous, sends all collected changes in a batch if it is

Contributor

sandreim May 31, 2023

It should only send the malicious rep change not the entire set.

Contributor Author

AndreiEres May 31, 2023

Looks reasonable,but then we're going to send hashmaps instead of a single message so we will need to prepare a hashmap only for a malicious change. Is it ok?

Contributor

sandreim Jun 1, 2023 •

edited

Loading

We can keep current message for the malicious ones and add another to send them in bulk

node/subsystem-util/src/reputation.rs Outdated

Comment on lines 83 to 88

+              		let current = match self.by_peer.get(&peer_id) {
+              			Some(v) => *v,
+              			None => 0,
+              		};
+              		let new_value = current.saturating_add(rep.cost_or_benefit());
+              		self.by_peer.insert(peer_id, new_value);

Contributor

sandreim May 31, 2023

Would look much better and simpler with entry api. Something like:

let cost = rep.cost_or_benefit();
self.by_peer.entry(peer_id)
          .and_modify(|old_rep| *old_rep = *old_rep.saturating_add(cost))
          .or_insert(cost)

sandreim mentioned this pull request

Network scalability: 500 parachain validators and 100 cores (async backing enabled) paritytech/roadmap#26

Open

26 tasks

AndreiEres marked this pull request as ready for review

June 5, 2023 10:45

AndreiEres force-pushed the AndreiEres/reputation-delay branch from f6cbaec to a7f9e88 Compare

June 6, 2023 09:56

AndreiEres requested a review from sandreim

June 6, 2023 12:07

sandreim reviewed

View reviewed changes

node/subsystem-util/src/reputation.rs Outdated Show resolved Hide resolved

alexggh reviewed

View reviewed changes

node/subsystem-util/src/reputation.rs

+              		rep: UnifiedReputationChange,
+              	) {
+              		if (self.send_immediately_if)(rep) {
+              			self.single_send(sender, peer_id, rep).await;

Contributor

alexggh Jun 6, 2023

Does it make sense instead of sending a single message here to just do an add and self.send ?

Contributor Author

AndreiEres Jun 7, 2023

It was originally, but then Andrei asked to change it to current behavior.
#7214 (comment)

node/subsystem-util/src/reputation.rs Outdated

+              	}
+              	fn add(&mut self, peer_id: PeerId, rep: UnifiedReputationChange) {
+              		add_reputation(&mut self.by_peer, peer_id, rep)

Contributor

alexggh Jun 6, 2023

I think that one thing that will be slightly different with this approach where we aggregate the reputation change will be here:https://github.com/paritytech/substrate/blob/master/client/network/src/peer_store.rs#L161. However, since our flush interval is relatively small 30s, I think we should be fine.

AndreiEres force-pushed the AndreiEres/reputation-delay branch 2 times, most recently from dda4119 to 6c9b0c1 Compare

June 7, 2023 08:18

AndreiEres requested a review from sandreim

June 7, 2023 11:55

vstakhov approved these changes

View reviewed changes

Contributor

vstakhov left a comment

LGTM with some minor nits.

node/subsystem-util/src/reputation.rs Outdated Show resolved Hide resolved

node/network/bridge/src/tx/mod.rs Outdated

               				gum::debug!(target: LOG_TARGET, ?peer, ?rep, action = "ReportPeer");
               			}
               			metrics.on_report_event();
               			network_service.report_peer(peer, rep);
               		},
+              		NetworkBridgeTxMessage::ReportPeer(ReportPeerMessage::Batch(batch)) => {
+              			let reports: Vec<(PeerId, ReputationChange)> = batch

Contributor

vstakhov Jun 9, 2023

Why do we need this intermediate Vec if we can just iterate over the original hash table?

sandreim approved these changes

View reviewed changes

Contributor

sandreim left a comment

Nice work @AndreiEres ! Let's burn this in on Versi before merging!

eskimor reviewed

View reviewed changes

node/network/bridge/src/tx/mod.rs

+              				}
+              				metrics.on_report_event();
+              				network_service.report_peer(peer, rep);

Member

eskimor Jun 14, 2023

If this batching helps, then we can likely gain more my pushing the batch even further. network_service by itself also again just forwards messages - might make sense to introduce a batch type there as well up until to the actual worker applying the changes.

Contributor

sandreim Jun 14, 2023

In our case it helps mostly by putting less pressure on the network bridge tx channel -> less times subsystems block when sending to it. I agree that the batching can be even more improved up to the network service layer. I'd say that is the next step after merging this. @AndreiEres can you create a ticket for this please ?

Contributor Author

AndreiEres commented Jun 15, 2023

bot merge

paritytech-processbot bot commented Jun 15, 2023

Error: Statuses failed for 99beef0

AndreiEres added 9 commits

June 15, 2023 14:56


          Add futures-timer

752e0f8


          Make cost_or_benefit public


          Update ReportPeer message format

0d4cdb0


          Add delay to reputation updates (dirtywork)

baf0d91


          Update ReputationAggregator

c29e469


          Update tests

72a8336


          Fix flucky tests

2ce22a1


          Move reputation to state

19c7628


          Use the main loop for handling reputation sendings

9f51ba7

AndreiEres and others added 22 commits

June 15, 2023 14:58


          Add reputation to StatementDistributionSubsystem

03ab24b


          Handle reputation in statement distribution

f1e3208


          Add delay test for polkadot-statement-distribution

db2d9f2


          Fix collator-protocol tests before applying reputation delay

daa2f32


          Remove into_base_rep

2a0cb04


          Add reputation to State

d68fddc


          Fix failed tests

adf5b08


          Add reputation delay

52494fc


          Update tests

62c3e33


          Add batched network message for peer reporting

df61a7f


          Update approval-distribution tests

704f87b


          Update bitfield-distribution tests

f7d4905


          Update statement-distribution tests

86b3b63


          Update collator-protocol tests

0c637ee


          Remove levels in matching

f1ea5d4


          Address clippy errors

6cc8fc9


          Fix overseer test

a60e449


          Add a metric for original count of rep changes

0a8738e


          Update Reputation

2ee603f


          Revert "Add a metric for original count of rep changes"

3f8b6e5

This reverts commit 6c9b0c1.


          Update node/subsystem-util/src/reputation.rs

375230c

Co-authored-by: Vsevolod Stakhov <vsevolod.stakhov@parity.io>


          Remove redundant vec

de9e834

AndreiEres force-pushed the AndreiEres/reputation-delay branch from 99beef0 to de9e834 Compare

June 15, 2023 12:59

Contributor Author

AndreiEres commented Jun 15, 2023

bot merge

paritytech-processbot bot merged commit 2d840ff into master

paritytech-processbot bot deleted the AndreiEres/reputation-delay branch

June 15, 2023 13:46

AndreiEres mentioned this pull request

Support batch peer reporting in NetworkService paritytech/polkadot-sdk#612

Open

sandreim mentioned this pull request

Reduce ReportPeer message count in NetworkBridge #5257

Closed

crystalin mentioned this pull request

Update substrate/polkadot/cumulus from v0.9.43 to v1.1.0 moonbeam-foundation/moonbeam#2535

Closed

JesseAbram mentioned this pull request

Update substrate and subxt entropyxyz/entropy-core#435

Merged

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

B0-silent C1-low T4-parachains_engineering