Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beefy: Tolerate pruned state on runtime API call #5197

Merged
merged 7 commits into from
Aug 5, 2024

Conversation

skunert
Copy link
Contributor

@skunert skunert commented Jul 31, 2024

While working on #5129 I noticed that after warp sync, nodes would print:

2024-07-29 17:59:23.898 ERROR ⋮beefy: 🥩 Error: ConsensusReset. Restarting voter.    

After some debugging I found that we enter the following loop:

  1. Wait for beefy pallet to be available: Pallet is detected available directly after warp sync since we are at the tip.
  2. Wait for headers from tip to beefy genesis to be available: During this time we don't process finality notifications, since we later want to inspect all the headers for authority set changes.
  3. Gap sync finishes, route to beefy genesis is available.
  4. The worker starts acting, tries to fetch beefy genesis block. It fails, since we are acting on old finality notifications where the state is already pruned.
  5. Whole beefy subsystem is being restarted, loading the state from db again and iterating a lot of headers.

This already happened before #5129.

@skunert skunert added the T0-node This PR/Issue is related to the topic “node”. label Jul 31, 2024
@skunert skunert requested review from serban300, acatangiu and a team July 31, 2024 10:22
@@ -451,7 +451,8 @@ where
state.set_best_grandpa(best_grandpa.clone());
// Overwrite persisted data with newly provided `min_block_delta`.
state.set_min_block_delta(min_block_delta);
debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db: {:?}.", state);
debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db.");
trace!(target: LOG_TARGET, "🥩 Loaded state: {:?}.", state);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, but the state is quite big and causes log files to be multiple gigabytes.

Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Now that the beefy worker is more resilient to the runtime call errors and not restarting that often, I wonder if it made things better from the memory perspective as well 🤔

I've seen some periodic spikes (up to 8mib nothing concerning, something to keep in mind) with beefy unbounded channels:
Screenshot 2024-07-31 at 14 44 28

Err(api_error) => {
// This can happen in case the block was already pruned.
// Mostly after warp sync when finality notifications are piled up.
debug!(target: LOG_TARGET, "🥩 Unable to check beefy genesis: {}", api_error);
Copy link
Contributor

@serban300 serban300 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is safe since this way we might miss consensus resets. And in this case, further worker processing might be incorrect. Thinking.

Longer term we plan to add a header log to handle this kind of situations. Something like this: paritytech/substrate#14765

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to store the beefy_genesis in UnpinnedFinalityNotification. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to store the beefy_genesis in UnpinnedFinalityNotification. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?

This could work 👍 .

Regarding missing the ConsensusReset. So eventually we will catch up to the latest blocks and will find that the beefy_genesis is indeed different to what we have stored. In that case we would trigger the consensus reset, just later as we reach the tip. Is that acceptable? What are the risks here? If not acceptable we could add beefy genesis in the UnpinnedFinalityNotification as you proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm looked a bit on the code. I was thinking there might be problems with the authority set, or the payload, but it looks like this can't happen. I think it's ok. But I would also wait for @acatangiu 's input on this.

Copy link
Contributor

@acatangiu acatangiu Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself is correct in what it does, but we might still have undiscovered corner cases that this will surface. I suggest to test it on Rococo where we have actually reset consensus and see how it behaves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I was not aware that this was done on rococo, will perform some tests there before merging

@@ -32,7 +32,7 @@ use crate::{
metrics::register_metrics,
};
use futures::{stream::Fuse, FutureExt, StreamExt};
use log::{debug, error, info, warn};
use log::{debug, error, info, log_enabled, trace, warn, Level};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

are log_enabled and Level used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will remove

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6880195

@skunert skunert enabled auto-merge August 5, 2024 07:14
@skunert skunert added this pull request to the merge queue Aug 5, 2024
Merged via the queue into paritytech:master with commit 2abd03e Aug 5, 2024
156 of 161 checks passed
@skunert skunert deleted the skunert/beefy-debugging branch August 5, 2024 08:10
ordian added a commit that referenced this pull request Aug 6, 2024
* master: (51 commits)
  Remove unused feature gated code from the minimal template (#5237)
  make polkadot-parachain startup errors pretty (#5214)
  Coretime auto-renew (#4424)
  network/strategy: Backoff and ban overloaded peers to avoid submitting the same request multiple times (#5029)
  Fix frame crate usage doc (#5222)
  beefy: Tolerate pruned state on runtime API call (#5197)
  rpc: Enable ChainSpec for polkadot-parachain (#5205)
  Add an adapter for configuring AssetExchanger (#5130)
  Replace env_logger with sp_tracing (#5065)
  Adjust sync templates flow to use new release branch (#5182)
  litep2p/discovery: Publish authority records with external addresses only (#5176)
  Run UI tests in CI for some other crates (#5167)
  Remove `pallet::getter` usage from the pallet-balances (#4967)
  pallet-timestamp: `UnixTime::now` implementation logs error only if called at genesis (#5055)
  [CI] Cache try-runtime check (#5179)
  [Backport] version bumps and the prdocs reordering from stable2407 (#5178)
  [subsystem-benchmark] Update availability-distribution-regression-bench baseline after recent subsystem changes (#5180)
  Remove pallet::getter usage from proxy (#4963)
  Remove pallet::getter macro usage from pallet-election-provider-multi-phase (#4487)
  Review-bot@2.6.0 (#5177)
  ...
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Aug 28, 2024
While working on paritytech#5129 I noticed that after warp sync, nodes would
print:
```
2024-07-29 17:59:23.898 ERROR ⋮beefy: 🥩 Error: ConsensusReset. Restarting voter.    
```

After some debugging I found that we enter the following loop:
1. Wait for beefy pallet to be available: Pallet is detected available
directly after warp sync since we are at the tip.
2. Wait for headers from tip to beefy genesis to be available: During
this time we don't process finality notifications, since we later want
to inspect all the headers for authority set changes.
3. Gap sync finishes, route to beefy genesis is available.
4. The worker starts acting, tries to fetch beefy genesis block. It
fails, since we are acting on old finality notifications where the state
is already pruned.
5. Whole beefy subsystem is being restarted, loading the state from db
again and iterating a lot of headers.

This already happened before paritytech#5129.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: done
Development

Successfully merging this pull request may close these issues.

5 participants