Bron-Kerbosch attestation aggregation #4507

GeemoCandama · 2023-07-14T18:48:59Z

Issue Addressed

This branch will address #3733

Proposed Changes

Incorporate the Bron-Kerbosch algorithm that Satalia wrote into the attestation aggregation flow. More info in #3733

Additional Info

@ankurdubey521 and I will be working on this.

GeemoCandama · 2023-07-28T19:50:34Z

I think it should be ready for a first pass.

Overview:

Deleted the importing agg from naive_aggregation_pool since the unagg attestations should already be in the op_pool.
During unagg attestation processing add them to the op_pool
Made Compact Attestation Data Clone (curious if this is the best solution here"
Altered fields of AttestationDataMap to a HashMap for aggregated_attestation and a HashMap for unaggregated_attestations.
Altered methods on AttestationDataMap accordingly
Removed greedy aggregation on insertion to the op_pool
Port Bron-Kerbosch implementation from Satalia to Lighthouse
Add get_clique_aggregate_attestations_for_epoch -> Vec<(&CompactAttestationData, CompactIndexedAttestation<T>)> method (This is where most of the changes are. I'm curious if it would be better to return an Iterator)
Use AttMaxCover of the output of the above as input to max_cover

Also the changes made a couple of previously used functions unused. Should I go ahead and delete them?

GeemoCandama · 2023-08-16T01:01:51Z

I think the general ideas are there however there are a few tests that stall forever that I'm working on:

test store_tests::delete_blocks_and_states has been running for over 60 seconds
test store_tests::prune_single_block_long_skip has been running for over 60 seconds
test store_tests::shuffling_compatible_simple_fork has been running for over 60 seconds
test store_tests::block_production_different_shuffling_early has been running for over 60 seconds
test store_tests::block_production_different_shuffling_long has been running for over 60 seconds

I also tried what I thought would be an optimization of going through the cliques and filtering them if they are a subset of another clique before aggregate the attestations. That didn't seem to improve the situation, so I reverted that for now.

EDIT: This is no longer an issue

beacon_node/operation_pool/src/lib.rs

GeemoCandama · 2023-10-14T21:05:11Z

This is from the block packing dashboard. I collect the parallel iterator because the validty_fn is not Sync. I think some of this could be worked out. I tried to return an iterator from the function but the validity_fn closure is captured and cant be returned.

…sh from a static graph

…e attestations in the op pool until calculating the attestations for block inclusion

…have not tested yet

…sure type

GeemoCandama · 2023-10-19T06:49:27Z

Before:

After:

It's a slight improvement on average. But After has slightly higher highs.

…gregation Fix compute_neighbourhoods

…estation_aggregation

michaelsproul · 2023-11-03T08:08:52Z

I was testing the performance with --subscribe-all-subnets --import-all-attestations and noticed that it starts to deteriorate after some time. I think it's because we're iterating and processing ~900k unaggregated attestations every slot once the pool fills up.

This is even with max-aggregates-per-data set to 1!

I realised that the validity_filter is somewhat unnecessary to run for every unaggregated attestation, as the filter we actually use only depends on the checkpoint & the attestation data. I made a hacky patch for this here, which seems to have improved the performance. I'll clean that up tomorrow or early next week (or hand it over to you if you're keen). I think a better solution would be to make the validity filter take a pair of Epoch (or CheckpointKey) and &AttestationData.

I'll post an updated graph once it's had a chance to warm up for a bit. The time seems to be more like the 30ms we see for regular nodes 👌

michaelsproul · 2023-11-03T09:23:00Z

Looks a lot better 🎉

michaelsproul · 2023-11-09T04:34:34Z

I think this is ready to go. I've been running it on one of our Holesky beacon nodes for the last few days without issue. If anything it's a little bit faster than unstable.

Here's the median attestation packing time for this PR (green) vs unstable (purple):

Here's the 95th percentile:

Here's total block production time (median over 12h):

95th percentile over 12h:

Investigating that spike with a shorter interval shows it's a spike to 2s+

This doesn't really correspond to a spike in attestation packing time, so I suspect it's another issue (lord knows we have state issues while packing blocks). The version of unstable running on the other BNs is also running without the changes from #4794.

I'll let @paulhauner do a once-over of the Rayon stuff (because we've been bitten before), and then we can merge.

michaelsproul · 2023-11-09T04:58:48Z

Logs from the spike also seem to show that it was actually only a spike to 1.2s, but even with histogram buckets I don't quite see how that makes any sense

Nov 08 06:08:01.222 DEBG Processed HTTP API request method: GET, path: /eth/v2/validator/blocks/293440, status: 200 OK, elapsed: 1.192266986s

paulhauner

Heyo, looking very impressive!

I had a look at just the Rayon code and I think there's a potential for a deadlock in there (see my comment).

paulhauner · 2023-11-14T06:04:45Z

beacon_node/operation_pool/src/lib.rs

+            .map(|(data, aggregates)| {
+                let aggregates: Vec<&CompactIndexedAttestation<T>> = aggregates
+                    .iter()
+                    .filter(|_| validity_filter(checkpoint_key, data))


I think we're at risk of the deadlock described in this HackMD (see also #1562).

The validity_filter function can try to access the fork choice read lock:

lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs

Line 2297 in 051c3e8

let fork_choice_lock = self.canonical_head.fork_choice_read_lock();

That means this function would behave the same as the rest_api::validator::publish_attestations function I describe in the HackMD.

In terms of a workaround, I'm not fully across this PR but I feel like the parallel functionality is most important for the second map fn (the one which calls bron_kerbosch). So perhaps the filter and first map could happen in a good ol' fashioned serial iter?

Hmm, good point. I'm trying to understand how we can deadlock if we are only reading and never holding a write lock while spawning a Rayon task. I think we are safe because we only have reads (item 2 from the HackMD) and not writes (item 1).

Reading the Rayon docs a bit more, I think how the deadlock happens is this:

Thread 1 obtains the write lock and then spawns a Rayon job on the Rayon thread pool and blocks waiting for it to complete.

Concurrently, another thread spawns a bunch of jobs on the Rayon thread pool which try to grab the read lock.

Rayon's pool "schedules" (="steals work such that") all of the reading jobs begin executing but not the job spawned by thread 1 which is required to release the lock. Unlike async tasks in Tokio-land, there is no implicit yielding from Rayon jobs once they start executing.

Therefore we get a deadlock: all the threads in the pool are blocked waiting for a job that can only execute in the pool, which will never execute.

Without holding an exclusive lock/mutex while spawning I think we are safe. If all the jobs in the pool are just grabbing read locks and there is no other thread holding a write lock while it waits for a Rayon job to complete, then we are OK. Even if we have other (non-Rayon) threads grabbing the write lock, this is OK too, as they will be run concurrently by the OS, and the write lock will eventually be released, allowing the Rayon threads that are reading to make progress.

I went looking for places where we are already locking an RwLock/Mutex inside a Rayon job and actually the current version of the op pool uses rayon::join on two lazy iterators that evaluate the same validity_filter (and therefore obtain a read lock on fork choice from within the Rayon pool):

lighthouse/beacon_node/operation_pool/src/lib.rs

Lines 307 to 325 in 441fc16

let (prev_cover, curr_cover) = rayon::join(

move || {

let _timer = metrics::start_timer(&metrics::ATTESTATION_PREV_EPOCH_PACKING_TIME);

// If we're in the genesis epoch, just use the current epoch attestations.

if prev_epoch_key == curr_epoch_key {

vec![]

} else {

maximum_cover(prev_epoch_att, prev_epoch_limit, "prev_epoch_attestations")

}

},

move || {

let _timer = metrics::start_timer(&metrics::ATTESTATION_CURR_EPOCH_PACKING_TIME);

maximum_cover(

curr_epoch_att,

T::MaxAttestations::to_usize(),

"curr_epoch_attestations",

)

},

);

i.e. we are already doing this and it is not deadlocking

Additionally in Milhouse, we are recursively obtaining a write lock inside Rayon jobs here:

https://github.com/sigp/milhouse/blob/6c82029bcbc656a3cd423d403d7551974818d45d/src/tree.rs#L430-L433

Mac and I played around with that code trying to use upgradable locks and quickly hit deadlocks, which I think implies the current pattern is safe (see our attempts in sigp/milhouse#25, sigp/milhouse#29).

TL;DR: I think that Rayon is only dangerous is you hold a lock while spawning jobs into the pool.

This was what I previously thought, but I'd never understood why this was the case and thought it had something to do with thread-local storage, context switches and mutexes. I find the explanation above quite satisfactory and hope you do too!

Also, if we were to remove the par_iter we need a collect, as @GeemoCandama showed in 376c408. If you agree with my analysis I reckon we revert that commit 🙏

Thanks for reviewing!

Without holding an exclusive lock/mutex while spawning I think we are safe.

I think we hold an exclusive lock whilst spawning here:

lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs

Lines 1500 to 1502 in 2bc9115

fork_choice

.on_block(current_slot, block, block_root, &state)

.map_err(|e| BlockError::BeaconChainError(e.into()))?;

We take a write-lock on fork choice, then we call on_block which might read a state from the DB which might involve rayon-parallelised tree hashing.

So the deadlock would be:

We take the fork choice write lock in on_block but don't yet trigger tree hashing.

We run the par_iter in this PR, which fills all the rayon threads with functions waiting on the fork choice read lock.

on_block tries to start tree hashing but it can't start because all the rayon threads are busy.

We are deadlocked because we need on_block to finish so we can drop the fork choice write-lock.

Oh bugger, I think that could deadlock even with our current code then.

I think we need to choose between:

Never hold a lock while spawning Rayon tasks, or

Never obtain a lock in a Rayon task that could be held over a spawn

I have a feeling that (1) is better in some way, but need to sleep on it.

Actually does on_block load states? I can't see it at a quick glance. I also don't think loading a state does any tree hashing (although it will in tree-states land, so definitely something to keep an eye on).

Ah yeah, here:

lighthouse/beacon_node/beacon_chain/src/beacon_fork_choice_store.rs

Lines 328 to 334 in 441fc16

let (_, state) = self

.store

.get_advanced_hot_state(

self.justified_checkpoint.root,

max_slot,

justified_block.state_root(),

)

If I get some time today or tomorrow I'll try writing a test with hiatus that deadlocks on unstable.

After a lot of mucking around I got block production & tree hashing to deadlock on unstable, sort of. The branch is here: michaelsproul@5e253de

It was really hard to get all the bits of line up, so there are 7 separate hiatus breakpoints. I also had to cheat and just force tree hashing in import_block because I ran out of patience to get the tree hashing and state load in on_block to trigger. The main problem is that the attestations were triggering the justified checkpoint update before I could trigger it with on_block. I think using the mainnet spec would make this a bit easier as we wouldn't have to justify on the epoch boundary (as we do with minimal).

I think this deadlock (on unstable) hasn't been observed in practice for several reasons:

It requires block production and block import to be running concurrently. This doesn't happen particularly often, but isn't that unlikely.

It requires the balances cache in fork choice to miss. This almost never happens in recent versions of Lighthouse.

It requires a machine with ~2 cores, because we only spawn two op pool tasks using rayon::join. If there are more than 2 Rayon threads in the pool then the tree hashing can make progress and drop the fork choice lock, which unblocks the op pool tasks.

In my example you can see the deadlock happen with:

RAYON_NUM_THREADS=2 FORK_NAME=capella cargo test --release --features fork_from_env -- deadlock_block_production --nocapture

If you bump the threadpool size to 3 then the deadlock doesn't happen (due to condition 3).

RAYON_NUM_THREADS=3 FORK_NAME=capella cargo test --release --features fork_from_env -- deadlock_block_production --nocapture

I haven't tried deadlocking Geemo's branch yet, but intuitively we make the deadlock more likely by spawning more threads for the attestation data (condition 3 no longer applies). Still, we are protected by the unlikeliness of condition 1 & condition 2 occurring simultaneously.

I need to think about it more.

michaelsproul · 2023-11-28T01:29:03Z

Hey @GeemoCandama I think to play it safe lets remove all the Rayon calls and assess the performance impact. After going deep down the Rayon rabbit hole I don't think there's a completely safe way for us to use Rayon in the op pool at the moment (see #4952).

I think we should revisit parallelism in the op pool at a later date, maybe using a different dedicated thread pool, or Tokio. Hopefully #4925 buys us a little bit of wiggle room by reducing block production times overall.

GeemoCandama · 2024-01-23T07:00:03Z

michaelsproul · 2024-02-14T01:57:37Z

I did some benchmarking with --subscribe-all-subnets and attestation packing was taking 100-250ms regularly. I think we need the thread pool. I will add a non-Rayon thread pool when I have some time. I'll bump this from the v5.0.0 release, which needs to happen next week.

ankurdubey521 mentioned this pull request Jul 25, 2023

[WIP] Optimal Maximum Coverage using CBC MIP Solver #4542

Closed

paulhauner added the work-in-progress PR is a work-in-progress label Jul 26, 2023

GeemoCandama force-pushed the bron_kerbosch_attestation_aggregation branch from 5329233 to f39becd Compare July 28, 2023 19:22

GeemoCandama force-pushed the bron_kerbosch_attestation_aggregation branch from b0da5a1 to af26f34 Compare September 28, 2023 15:31

michaelsproul reviewed Oct 4, 2023

View reviewed changes

beacon_node/operation_pool/src/lib.rs Outdated Show resolved Hide resolved

michaelsproul mentioned this pull request Oct 11, 2023

Update attestation packing for Deneb #4825

Open

GeemoCandama added 17 commits October 17, 2023 23:33

Add bron_kerbosch from the satalia and sigp implementation

59d9c0a

I needed to rebase and fmt

889da77

op-pool maintains unaggregated_attestations and calculates bron kerbo…

eb5d6d2

…sh from a static graph

add unagg attestations to op_pool during processing. fmt

7bac0dd

added a couple of comments

3fa1dd5

delete unused

2aeb8c6

fix using wrong checkpoint key

22d0d35

avoid underflow on zero length vector

673851a

This test is no longer the correct logic since we do not aggregate th…

3f0011b

…e attestations in the op pool until calculating the attestations for block inclusion

remove collect

2d211be

using u64 instead of &u64 I think this might have been the issue but …

649c4bd

…have not tested yet

remove subset attestations

4e81596

sproul patch: fixes iterator issue

f14b56f

use swap_remove

77bbfcc

sort unstable by

b02301a

minor

f77c5c5

bron_kerbosch done in parallel iter without changing the validity clo…

d2bd6af

…sure type

GeemoCandama force-pushed the bron_kerbosch_attestation_aggregation branch from b9c8fae to d2bd6af Compare October 18, 2023 05:16

GeemoCandama added 2 commits October 19, 2023 00:32

using keyed version of max cover

5d2c4ba

fmt

f7404d4

michaelsproul and others added 5 commits November 3, 2023 11:23

Add const for aggregates-per-data

bca7b9f

Fix compute_neighbourhood and tests!

498c2c1

Simplify stats

c6b9938

Merge pull request #3 from michaelsproul/bron_kerbosch_attestation_ag…

9ad61b9

…gregation Fix compute_neighbourhoods

Merge remote-tracking branch 'origin/unstable' into bron_kerbosch_att…

fdee0a0

…estation_aggregation

michaelsproul changed the title ~~[WIP] Bron-Kerbosch attestation aggregation~~ Bron-Kerbosch attestation aggregation Nov 3, 2023

michaelsproul marked this pull request as ready for review November 3, 2023 02:19

michaelsproul added ready-for-review The code is ready for review v4.6.0 ETA Q1 2024 under-review A reviewer has only partially completed a review. and removed work-in-progress PR is a work-in-progress ready-for-review The code is ready for review labels Nov 3, 2023

change validity filter and use it only when necessary

0e2f24e

michaelsproul approved these changes Nov 9, 2023

View reviewed changes

paulhauner self-requested a review November 14, 2023 00:32

paulhauner requested changes Nov 14, 2023

View reviewed changes

avoid deadlock by sequential iter and collecting then par iter

376c408

michaelsproul mentioned this pull request Nov 27, 2023

Decide on a policy for Rayon usage #4952

Open

michaelsproul added v5.0.0 Q1 2024 and removed v4.6.0 ETA Q1 2024 labels Dec 15, 2023

remove rayon from bron-kerbosch processing

32dd188

dapplion removed the v5.0.0 Q1 2024 label Feb 14, 2024

michaelsproul mentioned this pull request May 10, 2024

Optimise Electra op pool aggregation #5749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bron-Kerbosch attestation aggregation #4507

Bron-Kerbosch attestation aggregation #4507

GeemoCandama commented Jul 14, 2023

GeemoCandama commented Jul 28, 2023

GeemoCandama commented Aug 16, 2023 •

edited

Loading

GeemoCandama commented Oct 14, 2023

GeemoCandama commented Oct 19, 2023

michaelsproul commented Nov 3, 2023 •

edited

Loading

michaelsproul commented Nov 3, 2023

michaelsproul commented Nov 9, 2023

michaelsproul commented Nov 9, 2023

paulhauner left a comment

paulhauner Nov 14, 2023

michaelsproul Nov 14, 2023

paulhauner Nov 14, 2023 •

edited

Loading

michaelsproul Nov 14, 2023 •

edited

Loading

michaelsproul Nov 14, 2023

michaelsproul Nov 14, 2023

michaelsproul Nov 15, 2023

michaelsproul Nov 17, 2023

michaelsproul commented Nov 28, 2023

GeemoCandama commented Jan 23, 2024

michaelsproul commented Feb 14, 2024

	let (prev_cover, curr_cover) = rayon::join(
	move \|\| {
	let _timer = metrics::start_timer(&metrics::ATTESTATION_PREV_EPOCH_PACKING_TIME);
	// If we're in the genesis epoch, just use the current epoch attestations.
	if prev_epoch_key == curr_epoch_key {
	vec![]
	} else {
	maximum_cover(prev_epoch_att, prev_epoch_limit, "prev_epoch_attestations")
	}
	},
	move \|\| {
	let _timer = metrics::start_timer(&metrics::ATTESTATION_CURR_EPOCH_PACKING_TIME);
	maximum_cover(
	curr_epoch_att,
	T::MaxAttestations::to_usize(),
	"curr_epoch_attestations",
	)
	},
	);

	fork_choice
	.on_block(current_slot, block, block_root, &state)
	.map_err(\|e\| BlockError::BeaconChainError(e.into()))?;

	let (_, state) = self
	.store
	.get_advanced_hot_state(
	self.justified_checkpoint.root,
	max_slot,
	justified_block.state_root(),
	)

Bron-Kerbosch attestation aggregation #4507

Are you sure you want to change the base?

Bron-Kerbosch attestation aggregation #4507

Conversation

GeemoCandama commented Jul 14, 2023

Issue Addressed

Proposed Changes

Additional Info

GeemoCandama commented Jul 28, 2023

GeemoCandama commented Aug 16, 2023 • edited Loading

GeemoCandama commented Oct 14, 2023

GeemoCandama commented Oct 19, 2023

michaelsproul commented Nov 3, 2023 • edited Loading

michaelsproul commented Nov 3, 2023

michaelsproul commented Nov 9, 2023

michaelsproul commented Nov 9, 2023

paulhauner left a comment

Choose a reason for hiding this comment

paulhauner Nov 14, 2023

Choose a reason for hiding this comment

michaelsproul Nov 14, 2023

Choose a reason for hiding this comment

paulhauner Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

michaelsproul Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

michaelsproul Nov 14, 2023

Choose a reason for hiding this comment

michaelsproul Nov 14, 2023

Choose a reason for hiding this comment

michaelsproul Nov 15, 2023

Choose a reason for hiding this comment

michaelsproul Nov 17, 2023

Choose a reason for hiding this comment

michaelsproul commented Nov 28, 2023

GeemoCandama commented Jan 23, 2024

michaelsproul commented Feb 14, 2024

GeemoCandama commented Aug 16, 2023 •

edited

Loading

michaelsproul commented Nov 3, 2023 •

edited

Loading

paulhauner Nov 14, 2023 •

edited

Loading

michaelsproul Nov 14, 2023 •

edited

Loading