beefy: Tolerate pruned state on runtime API call #5197

skunert · 2024-07-31T10:22:04Z

While working on #5129 I noticed that after warp sync, nodes would print:

2024-07-29 17:59:23.898 ERROR ⋮beefy: 🥩 Error: ConsensusReset. Restarting voter.

After some debugging I found that we enter the following loop:

Wait for beefy pallet to be available: Pallet is detected available directly after warp sync since we are at the tip.
Wait for headers from tip to beefy genesis to be available: During this time we don't process finality notifications, since we later want to inspect all the headers for authority set changes.
Gap sync finishes, route to beefy genesis is available.
The worker starts acting, tries to fetch beefy genesis block. It fails, since we are acting on old finality notifications where the state is already pruned.
Whole beefy subsystem is being restarted, loading the state from db again and iterating a lot of headers.

This already happened before #5129.

skunert · 2024-07-31T10:24:37Z

substrate/client/consensus/beefy/src/lib.rs

@@ -451,7 +451,8 @@ where
 			state.set_best_grandpa(best_grandpa.clone());
 			// Overwrite persisted data with newly provided `min_block_delta`.
 			state.set_min_block_delta(min_block_delta);
-			debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db: {:?}.", state);
+			debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db.");
+			trace!(target: LOG_TARGET, "🥩 Loaded state: {:?}.", state);


This is unrelated, but the state is quite big and causes log files to be multiple gigabytes.

lexnv

LGTM!

Now that the beefy worker is more resilient to the runtime call errors and not restarting that often, I wonder if it made things better from the memory perspective as well 🤔

I've seen some periodic spikes (up to 8mib nothing concerning, something to keep in mind) with beefy unbounded channels:

serban300 · 2024-07-31T11:54:19Z

substrate/client/consensus/beefy/src/worker.rs

+			Err(api_error) => {
+				// This can happen in case the block was already pruned.
+				// Mostly after warp sync when finality notifications are piled up.
+				debug!(target: LOG_TARGET, "🥩 Unable to check beefy genesis: {}", api_error);


I'm not sure if this is safe since this way we might miss consensus resets. And in this case, further worker processing might be incorrect. Thinking.

Longer term we plan to add a header log to handle this kind of situations. Something like this: paritytech/substrate#14765

Another option would be to store the beefy_genesis in UnpinnedFinalityNotification. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?

Another option would be to store the beefy_genesis in UnpinnedFinalityNotification. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?

This could work 👍 .

Regarding missing the ConsensusReset. So eventually we will catch up to the latest blocks and will find that the beefy_genesis is indeed different to what we have stored. In that case we would trigger the consensus reset, just later as we reach the tip. Is that acceptable? What are the risks here? If not acceptable we could add beefy genesis in the UnpinnedFinalityNotification as you proposed.

Hmmm looked a bit on the code. I was thinking there might be problems with the authority set, or the payload, but it looks like this can't happen. I think it's ok. But I would also wait for @acatangiu 's input on this.

The change itself is correct in what it does, but we might still have undiscovered corner cases that this will surface. I suggest to test it on Rococo where we have actually reset consensus and see how it behaves.

Ah okay I was not aware that this was done on rococo, will perform some tests there before merging

acatangiu · 2024-08-01T08:47:47Z

substrate/client/consensus/beefy/src/lib.rs

@@ -32,7 +32,7 @@ use crate::{
 	metrics::register_metrics,
 };
 use futures::{stream::Fuse, FutureExt, StreamExt};
-use log::{debug, error, info, warn};
+use log::{debug, error, info, log_enabled, trace, warn, Level};


nit:

are log_enabled and Level used?

Good catch, will remove

paritytech-cicd-pr · 2024-08-01T09:53:47Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6880195

* master: (51 commits) Remove unused feature gated code from the minimal template (#5237) make polkadot-parachain startup errors pretty (#5214) Coretime auto-renew (#4424) network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times (#5029) Fix frame crate usage doc (#5222) beefy: Tolerate pruned state on runtime API call (#5197) rpc: Enable ChainSpec for polkadot-parachain (#5205) Add an adapter for configuring AssetExchanger (#5130) Replace env_logger with sp_tracing (#5065) Adjust sync templates flow to use new release branch (#5182) litep2p/discovery: Publish authority records with external addresses only (#5176) Run UI tests in CI for some other crates (#5167) Remove `pallet::getter` usage from the pallet-balances (#4967) pallet-timestamp: `UnixTime::now` implementation logs error only if called at genesis (#5055) [CI] Cache try-runtime check (#5179) [Backport] version bumps and the prdocs reordering from stable2407 (#5178) [subsystem-benchmark] Update availability-distribution-regression-bench baseline after recent subsystem changes (#5180) Remove pallet::getter usage from proxy (#4963) Remove pallet::getter macro usage from pallet-election-provider-multi-phase (#4487) Review-bot@2.6.0 (#5177) ...

While working on paritytech#5129 I noticed that after warp sync, nodes would print: ``` 2024-07-29 17:59:23.898 ERROR ⋮beefy: 🥩 Error: ConsensusReset. Restarting voter. ``` After some debugging I found that we enter the following loop: 1. Wait for beefy pallet to be available: Pallet is detected available directly after warp sync since we are at the tip. 2. Wait for headers from tip to beefy genesis to be available: During this time we don't process finality notifications, since we later want to inspect all the headers for authority set changes. 3. Gap sync finishes, route to beefy genesis is available. 4. The worker starts acting, tries to fetch beefy genesis block. It fails, since we are acting on old finality notifications where the state is already pruned. 5. Whole beefy subsystem is being restarted, loading the state from db again and iterating a lot of headers. This already happened before paritytech#5129.

skunert added 3 commits July 29, 2024 11:20

Fix ConsensusReset Loop

de0e209

Remove unwanted changes

6998037

Improve logging

f06cce4

skunert added the T0-node This PR/Issue is related to the topic “node”. label Jul 31, 2024

skunert requested review from serban300, acatangiu and a team July 31, 2024 10:22

prdoc

02aa612

skunert commented Jul 31, 2024

View reviewed changes

lexnv approved these changes Jul 31, 2024

View reviewed changes

serban300 reviewed Jul 31, 2024

View reviewed changes

acatangiu approved these changes Aug 1, 2024

View reviewed changes

Some more logging, remove unwanted imports

568ba1b

skunert added 2 commits August 5, 2024 09:10

Import

98f8a98

Merge branch 'master' into skunert/beefy-debugging

46d87af

skunert enabled auto-merge August 5, 2024 07:14

skunert added this pull request to the merge queue Aug 5, 2024

Merged via the queue into paritytech:master with commit 2abd03e Aug 5, 2024
156 of 161 checks passed

skunert deleted the skunert/beefy-debugging branch August 5, 2024 08:10

lexnv mentioned this pull request Aug 5, 2024

network: Investigate sudden sync peer count drops #5236

Open

skunert mentioned this pull request Aug 5, 2024

Add zombienet warp sync test including a beefy ConsensusReset #5238

Open

ggwpez mentioned this pull request Sep 9, 2024

Parent issue for stable2409 LTS release #5583

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beefy: Tolerate pruned state on runtime API call #5197

beefy: Tolerate pruned state on runtime API call #5197

skunert commented Jul 31, 2024

skunert Jul 31, 2024

lexnv left a comment •

edited

Loading

serban300 Jul 31, 2024 •

edited

Loading

serban300 Jul 31, 2024

skunert Jul 31, 2024

serban300 Jul 31, 2024

acatangiu Aug 1, 2024 •

edited

Loading

skunert Aug 1, 2024

acatangiu Aug 1, 2024

skunert Aug 1, 2024

paritytech-cicd-pr commented Aug 1, 2024

beefy: Tolerate pruned state on runtime API call #5197

beefy: Tolerate pruned state on runtime API call #5197

Conversation

skunert commented Jul 31, 2024

skunert Jul 31, 2024

Choose a reason for hiding this comment

lexnv left a comment • edited Loading

Choose a reason for hiding this comment

serban300 Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

serban300 Jul 31, 2024

Choose a reason for hiding this comment

skunert Jul 31, 2024

Choose a reason for hiding this comment

serban300 Jul 31, 2024

Choose a reason for hiding this comment

acatangiu Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

skunert Aug 1, 2024

Choose a reason for hiding this comment

acatangiu Aug 1, 2024

Choose a reason for hiding this comment

skunert Aug 1, 2024

Choose a reason for hiding this comment

paritytech-cicd-pr commented Aug 1, 2024

lexnv left a comment •

edited

Loading

serban300 Jul 31, 2024 •

edited

Loading

acatangiu Aug 1, 2024 •

edited

Loading