Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Client Node Block Syncing Fails Due to Current/Next Round Comparison #4

Open
fulltimemike opened this issue Apr 11, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@fulltimemike
Copy link

fulltimemike commented Apr 11, 2024

🐛 Bug Report

Sometimes after a Client Node is restarted, the following error message will pop up: The next block (X) is invalid - Failed to speculate on transactions - Failed to post-ratify - Next round Y must be greater than current round Y. This error causes the client to stop syncing, and restarting the client further does not fix the syncing bug. To allow the client to continue syncing, the client ledger must be modified -- either the ledger must be reset to allow the client to resync from genesis, or a snapshot must be loaded into the client to continue syncing.

I'm uncertain whether this bug is directly in snarkOS, or if it is a problem with snarkVM. The specific error is thrown here.

Logs directly before the bug is thrown.

In this example, interestingly, blocks and rounds much further ahead (block: 185,032, round: 412383) seem to be logged and added to the ledger than the block and round identified in the error thrown (block: 111196, round: 252154). I'm not sure why the store is apparently adding previous rounds and blocks when it has already surpassed this point.

Steps to Reproduce

Across multiple canary net client nodes, we have observed behavior where restarting the node causes syncing to fail. This bug is nondeterministic, but we have seen that restarting a client node enough times will cause the error to pop up. It may be necessary for the client to be actively syncing during restarts to cause this bug, but I can't be certain.

Expected Behavior

Restarting a client node should not cause the client to get stuck permanently when syncing.

Your Environment

This environment is running on an EC2 linux machine, running a fork of snarkOS with commits up to AleoNet@6aba25d.

@fulltimemike fulltimemike added the bug Something isn't working label Apr 11, 2024
@Meshiest
Copy link

image
Flat lines in this chart indicate the issue occurring

network topology:

  • 10 validator devnet on AWS c6a.8xlarges
  • 0 clients
  • no dedicated tx cannon

reproduce with some automation to reset the ledger of the same 2 every 30 minutes.

As early as within the first 500 blocks we frequently run into this issue on either or both of the 2 reset validators after reaching tip.

logs in gdrive

notes:

  • we are running a wrapper around snarkos to make checkpoints but the core snarkos code is only modified with the canary patch
  • rebooting from this state usually results in a "missing block hash" corrupted ledger error
  • we were able to reproduce this by locally running 10 validators on the same machine

@raychu86
Copy link

We have a tentative fix for this issue for validators - https://github.com/AleoHQ/snarkOS/pull/3232. The fix is currently undergoing burn-in testing and internal verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants