Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warp_sync hanging on AwaitPeers if warp_sync_params is None #14

Closed
2 tasks done
brunopgalvao opened this issue Apr 14, 2023 · 8 comments
Closed
2 tasks done

warp_sync hanging on AwaitPeers if warp_sync_params is None #14

brunopgalvao opened this issue Apr 14, 2023 · 8 comments
Labels
I2-bug The node fails to follow expected behavior. T0-node This PR/Issue is related to the topic “node”.

Comments

@brunopgalvao
Copy link
Contributor

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

If warp_sync_params is set to None:

warp_sync_params: None

warp_sync will hang on AwaitingPeers:
https://github.com/paritytech/substrate/blob/53c58fbc3dc6060a2f611a8d7adfdeea655fe324/client/network/sync/src/lib.rs#L516-L517

Steps to reproduce

Set warp_sync_params here to None.

@arkpar
Copy link
Member

arkpar commented Apr 14, 2023

After paritytech/substrate#12761
If a parachain does not configure warp_sync_params but the user still passes SyncMode::Warp, the sync ends up in AwaitingPeers state forever, which is indeed incorrect. I can propose two possible solutions here:

  1. Move warp_sync_params into SyncMode::Warp enum variant, so that it is always set when warp sync is enabled.
  2. Fallback to SyncMode::Full when warp_sync_params are missing and maybe print a misconfiguration warning.

@arkpar arkpar added I3-bug T0-node This PR/Issue is related to the topic “node”. and removed J2-unconfirmed labels Apr 14, 2023
@shunsukew
Copy link

shunsukew commented Apr 14, 2023

In my testing, actually, warp_sync_params is set to Some(WarpSyncParams::WaitForTarget(target_block)) https://github.com/AstarNetwork/Astar/pull/901/files & https://github.com/paritytech/cumulus/blob/d6eef144421ef5c3f339f681484d06bb729dfa82/client/service/src/lib.rs#L303.

And it downloads state for a certain period of time, but it gets halted at AwaitingPeers. Restarting the process resumes downloading states.

@bkchr
Copy link
Member

bkchr commented Apr 15, 2023

@shunsukew how many peers are you connected to?

@shunsukew
Copy link

@bkchr
Copy link
Member

bkchr commented Apr 22, 2023

@shunsukew your problem is something different! If you run the warp sync you will see that when it finishes downloading the state it prints out an error. Then the warp sync is stuck as you see it.

The first thing you need is to add this "fix":

diff --git a/bin/collator/src/parachain/shell_upgrade.rs b/bin/collator/src/parachain/shell_upgrade.rs
index acf7a3cc..b0858139 100644
--- a/bin/collator/src/parachain/shell_upgrade.rs
+++ b/bin/collator/src/parachain/shell_upgrade.rs
@@ -21,7 +21,7 @@
 use cumulus_client_consensus_common::{ParachainCandidate, ParachainConsensus};
 use cumulus_primitives_core::relay_chain::{Hash as PHash, PersistedValidationData};
 use futures::lock::Mutex;
-use sc_consensus::{import_queue::Verifier as VerifierT, BlockImportParams};
+use sc_consensus::{import_queue::Verifier as VerifierT, BlockImportParams, ForkChoiceStrategy};
 use sp_api::ApiExt;
 use sp_consensus::CacheKeyId;
 use sp_consensus_aura::{sr25519::AuthorityId as AuraId, AuraApi};
@@ -113,7 +113,7 @@ where
 {
     async fn verify(
         &mut self,
-        block_import: BlockImportParams<Block, ()>,
+        mut block_import: BlockImportParams<Block, ()>,
     ) -> Result<
         (
             BlockImportParams<Block, ()>,
@@ -121,6 +121,18 @@ where
         ),
         String,
     > {
+        // Skip checks that include execution, if being told so or when importing only state.
+        //
+        // This is done for example when gap syncing and it is expected that the block after the gap
+        // was checked/chosen properly, e.g. by warp syncing to this block using a finality proof.
+        // Or when we are importing state only and can not verify the seal.
+        if block_import.with_state() || block_import.state_action.skip_execution_checks() {
+            // When we are importing only the state of a block, it will be the best block.
+            block_import.fork_choice = Some(ForkChoiceStrategy::Custom(block_import.with_state()));
+
+            return Ok((block_import, None));
+        }
+
         let block_hash = *block_import.header.parent_hash();
 
         if self

This will skip the verification when we import state changes directly. However, you will still see an issue about storage root being "invalid". The problem here is the following pr: AstarNetwork/Astar#567

It set state_version: 1 without having any migration being run. This means that the state you download is partly state_version = 0 and state_version = 1. This leads then to a wrong state root as stored in the header. You will need to add the state migration pallet to your runtime, let the entire state migrate to state_version = 1 and then warp sync should work with the patch I posted above.

@cheme should be able to provide you some guide on how to setup the state migration pallet for parachains.

CC @akru

@shunsukew
Copy link

shunsukew commented Apr 24, 2023

Thank you so much for checking and letting us know the required changes.
We will conduct state_version migration. It would probably be better to continue my topic in Astar's repo, since we have a different problem from this Github Issue. please let me mention you there.
AstarNetwork/Astar#901 (comment)

As for Verifier fix,
This Verifier is wrapper around AuraVerifier and RelayChainVerifier.
And inside AuraVerifier's verify function, it is checking below as you suggested.

if block.with_state() || block.state_action.skip_execution_checks() {
			// When we are importing only the state of a block, it will be the best block.
			block.fork_choice = Some(ForkChoiceStrategy::Custom(block.with_state()));

			return Ok((block, Default::default()))
		}

So, considering Astar had Aura runtime API from the beginning, this shouldn't be directly related to the problem we have I think. (It is necessary for Shiden, but I'm only testing Astar warp sync).

@cheme
Copy link
Contributor

cheme commented Apr 24, 2023

@shunsukew , small write up as info may be a bit sparse.

If the chain was not using state_version at first and when version 1 afterward, then it is very likely that the state is in "hybrid" mode, some nodes runing on state 0 and some on state 1.

To check that, the first step would be to run this rpc: state.trieMigrationStatus

The rpc is just iterating on all the chain trie nodes and look for state version 0 that should be migrated to state version 1 (paritytech/cumulus#1424).

If there is items to migrate in the result, indeed those node needs to be migrated before warpsync can be use.

Changing an node from state v0 to v1 just require to rewrite the value.

Thus the pallet migration is doing it. See : https://hackmd.io/JagpUd8tTjuKf9HQtpvHIQ
Note that the migration rewrite all state value (we cannot say if a node is v0 or v1 from the runtime).

There is two scenarii:

  • not a parachain: a migration with a size limit run in each blocks and a cursor in state is updated each time.
  • a parachain: extrinsic emitted from a specific account are allowed to migrate values (and move cursor).

Should be noticed that parachain process require specific right for a parachain account and an external process that emit extrinsic.
This is needed to ensure the migration do not get stuck in an oversized block (for instance cursor is doing 10 value and first nine put things near size limit and the 10th is putting things other it: then the 10th value will always be in the pov and all pov will be oversized, by doing it manually the blocks with the extrinsic simply could not be produced and an extrinsic with less value should be emitted).

Personally, if using the manual migration is a no go (for instance if giving right to emit these extrinsic for a given account is a no go), I would run the automatic migration with a modification:

  • modify the rpc to find key of value with a problematic size
  • add a btree of those value to the pallet
  • pallet first process those value one per block
  • pallet skep these value when running the automatic migration afterward

obviously only work if there is not too many of these big values, and if the chain will never create new key with these big value (could not be skipped if unknown when we include the pallet).
These change to the pallet are not too difficult I think.

Regarding the migration testing I would suggest doing so with try_runtime.

@shunsukew
Copy link

Thank you so much for the detailed info.

@the-right-joyce the-right-joyce transferred this issue from paritytech/substrate Aug 24, 2023
@the-right-joyce the-right-joyce added I2-bug The node fails to follow expected behavior. and removed I3-bug labels Aug 25, 2023
@bkchr bkchr closed this as completed Aug 12, 2024
@github-project-automation github-project-automation bot moved this from Backlog 🗒 to Blocked ⛔️ in Networking Aug 12, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in SDK Node Aug 12, 2024
liuchengxu added a commit to subcoin-project/polkadot-sdk that referenced this issue Sep 20, 2024
* Improve the error handling in BitcoindBackend

* Add tools subcommand

* Add blockchain subcommand

* Update getting_started.md

* Make gettxoutsetinfo interruputible

* Split into installation.md and usage.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Blocked ⛔️
Status: done
Development

No branches or pull requests

6 participants