-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consensus fails when using statesync mode to synchronize the application state #9324
Comments
@alessio @alexanderbez @tessr I posted it here, so we can discuss it more. |
Thank you @chengwenxi! I appreciate the detailed writeup. We definitely need to investigate this. However, my initial suspicious is why have we not experienced or seen this reported on the mainnet hub? I know for sure there are state-sync nodes that take snapshots post IBC-enablement. @marbar3778 can you confirm? I suppose we could easily spin up a node to verify this. Do we know of any state sync providers? |
I was able to state sync prior to IBC was enabled, but this smells like an issue where state sync only syncs state in IAVL, I'm guessing capability is not in there. |
Yes, caps are in-memory only. They're not part of state-sync on purpose. |
Yes, in this case, this problem is not reproduced in the mainnet hub because of the caps is added on |
/cc @colin-axner @AdityaSripal @cwgoes for visibility. |
This is the issue right? Does NewApp get called for state sync? Or I guess any usage of capabilities during state sync is a problem? In IBC capabilities are created at various times, sometimes during InitChain for binding ports, always during a channel handshake, and randomly by applications as they decide to bind to new port names. |
Yes, NewApp is called. But if using state sync, the App generated by genesis that does not contain the capabilities. I think app.CapabilityKeeper.InitializeAndSeal(ctx) should be called again to initialize the capabilities after the state synchronization is completed. |
Just experience this on Cosmos Hub. I’m trying to state sync a new node.
Tried a few times and it panics at the same block height. |
Thanks for confirming @kwunyeung! How do we go about fixing this @colin-axner @AdityaSripal? Seems like a pretty serious bug. |
@kwunyeung Hi, kwunyeung! Could you get me some snapshot nodes? I want to reproduce the issue #9234. |
@chengwenxi you can connect to the following nodes. They both have snapshots.
And if you need RPCs |
extremely grateful ❤️ ! |
I think we need have something that iterates every open port and creates a capability after state is synced. Should be a non-consensus breaking fix. |
We will discuss this later today
This sounds like the right approach to me. I'm not sure how claiming capabilities will work. I'd also be nice to have a general solution. This does not only affect IBC, this affects any module that uses x/capability |
The issue is that The solution I have in mind is to fill in the in-memory component as-needed whenever |
I have a non-breaking fix up that will be able to fix the issue for the 0.42 line here: https://github.com/cosmos/cosmos-sdk/tree/aditya/cap-init Here's the diff: https://github.com/cosmos/cosmos-sdk/compare/v0.42.5...aditya/cap-init?expand=1 Unfortunately the fix I proposed above can only be done efficiently if we move the reverse mapping into the persistent store. The reverse mapping is deterministic so there's no issue moving it, it's just a breaking change. Once that is done, reconstructing the forward mapping and capmap on-the-fly is trivial. This fix should go into 0.43 I will write tests for this tomorrow, but in the meantime it would be great if someone is able to test it out and see if statesync works. |
@chengwenxi @kwunyeung can you apply @AdityaSripal's patch to the sources, rebuild and see whether this fixes the issue for Cosmos SDK v0.42 and gaia, please? |
It seems that |
@alessio applied the patch to |
I've encountered this issue before and I think it tends to happen if the pruning settings and the snapshot settings are not aligned correctly but it's a total pain to debug. |
Note the previous commit had an issue. Please try against the latest PR above. |
@zmanian those nodes are using
or
now |
What is the trust period on the syncing nodes set to? |
@zmanian 336h |
Could you post an exerpt of your config.toml statesync section? It is either you dont have enough rpc connections or the light client issue I also ran into. (tendermint/tendermint#6389). Could you also change your log level to debug to see what is going on? |
@marbar3778 I had started a new node in the same private network as those two nodes serving snapshots and enabled debug last night. You can download a short extract of the debug log below. https://forbole-val.s3.ap-east-1.amazonaws.com/gaiad-state-sync-log.txt.gz I saw the following log
And then the node did requested snapshot chunks but I never saw it applied the chunks.
After getting all snapshot chunks, it said
The most interesting is that the node got panic earlier is actually in the same private network and it started download the snapshot and applied the chunks immediately. The state was restored in minutes and started committing blocks but got panic at certain height. Actually that node is now running as I unsafe reset and restarted it right after that panic. The state was synced again. I stopped that node right after the state restored, disabled state sync and started it again. Then the node runs smoothly till now. When I tried to test the fix, I was running on another machine. I couldn’t restore the state, so as this new machine with the I left that new node running the whole day as I had to finish some other work. I have just had time to check it again. State not restored yet and the node keep only trying to connect new nodes. I stopped it, unsafe reset it and started it again. Voila! It downloaded snapshot chunks and applied them immediately! I’m going to run it over night again and see if it would panic at some point. As I restarted the node without changing any settings, the only interesting thing to notice would be the |
@AdityaSripal the node has been running without issue for 3 days already. No crash no restart. |
Summary of Bug
Consensus fails when using statesync mode to synchronize the application state and then execute the ibc-transfer transaction.
Description
When the cosmos-sdk-based chain is started, the capability/keeper/keeper.go#L177:InitializeCapability(...) method will be called to initialize the memStore from the application store. However, if the node is started using statesync mode, the application store will not be loaded until the node is switched to fastsync mode. But in this case, the method InitializeCapability will not be called again to initialize memStore. Therefore, when calling the method capability/keeper/keeper.go#L344:GetCapability(...), the node started using statesync mode cannot get the same result as other node.
Steps to Reproduce
The GetCapability(...) mothod used in IBC module, so it can be reproduced through ibc-transfer:
Start two testnets via gaia and create relayer for them, then create clients and channels. Refer: https://github.com/cosmos/relayer#demo
Create node1, node2 to join testnet
ibc-0
:then update state-sync config in node1/config/app.toml and node2/config/app.toml:
Start node1 and node2:
Create node3 to join testnet
ibc-0
.Update config:
Send ibc-transfer
rly tx transfer ibc-0 ibc-1 1000000samoleans $(rly chains address ibc-1) rly tx relay-packets demo -d
Start node3 using
statesync
mode# NOTE: modify ports and add ibc-0 peer gaiad start --home node3
Get consensus failure error on executing the ibc-transfer transaction:
NOTE: if the latest block height is greater than the ibc-transfer tranaction exexuted height, no error is returned, you can unsafe-reset-all node3 and repeat steps 4-5.
For Admin Use
The text was updated successfully, but these errors were encountered: