-
Notifications
You must be signed in to change notification settings - Fork 20k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Geth refuses to sync after being a few months offline #25730
Comments
It looks like you were not fully snap synced when you shut down your node the last time. You were most likely still in the state heal phase. Otherwise geth would import blocks one by one afterwards. The problem now is that you have a really old state that is not fully healed, so a lot of accounts, balances etc are different than the current state. In the state healing phase these states are fetched one by one from the network. Since they are so many differences between you and the network, the state heal phase will likely never finish. The best way to fix this is to stop geth and remove the old state that you have (via Regarding your PS: nodes can recover very quickly from a couple of days of downtime once they are fully synced (and healed). Since the gas limit in ethereum was voted up and is pretty high atm, the nodes have to do a lot of computation, especially if they want to sync up to the network |
Thanks for the reply. But I am 100% sure that it synced.. checked it form the geth console, etc. I guess there's nothing else left to do than to do yet another snap sync & I keep this ticket still open a few days just in case.. To confirm.. once the snap sync is over, after that geth will behave like a normal "full" node.. there's no condition/bug? under which it would start a snap sync again. And it's pitty no-one never explains that "Unexpected bytecode heal packet"..! |
Yes a node will never start a sync again after its been synced once |
I'm running
Should I remove ancient database as well..? I keps the ancient database & now it seems to work like a full node..?
State heal logs missing at the moment |
I've spent some time looking around and trying to understand the differences between geth's sync modes but am still left with some open questions. For example: Does a geth node that either snap or fast sync's initially, end up with the same database state as a "full node" that has executed all of the transactions since the genesis block? Or does snap sync involve only acquiring a "digest" of the storage state of all addresses as of the sync point, such that future requests made to this node for storage as of some older block, would then involve the node making new requests to peers to acquire that older state? Please forgive me if any of this is malformed. I'm asking these questions having made some possibly incorrect assumptions about "everything." Having just written these questions, I thought of a better google query, and found this: #24413 Probably a good idea to incorporate some/all of the wisdom there, into the docs. |
After doing ``geth removedb" (but keeping ancient database/freezer) now it's in the state heal process again. It downloaded all missing blocks and after that "state heal" logs started to appear. Now, as always, logs are saturated with:
State healing is painstakingly slow:
Maybe 500 000 accounts per 24h ..? So download recent global state "SC0" using snap then that global state needs to be "healed" using the downloaded blocks "SC0" --> blocks --> "SC1" (current global state) Nobody knows why.. just (clueless) nerds (like myself) trying to scrap information from the internet. Do you have any "testing" lab where you put clients on & offline & try to disturb them in some private geth ethereum network? Or does "ci/cd testing" equal to "merge" (I hope not). EDIT: maybe the ancient database is corrupt..? Does the ancient database include a state trie that is loaded in the beginning & then it tries to heal it.. |
On a related note, I ended up with the impression that it has become quite difficult to stand up a full node in lieu of using centralized providers such as alchemy/moralis/infura etc because of these types of troubles (and perhaps my last attempt was when snap sync was too new). That slowness and/or troubles in this area ends up pushing dApp providers to rely on centralized resources such as Infura/Moralis/Alchemy (vs it being super easy to stand up their own nodes) may be an issue to think about (pros and cons to each). It would be really nice to be able to stand up a full node that can answer historical requests (gcmode archive) quickly by acquiring that database and/or being able to forward such queries to/from peers, and then verifying it on a best-efforts basis over a longer timeframe. I'm not fully conversant on the security implications and/or implementation difficulties and/or bandwidth/resource management issues of doing such a thing -- but I sense it would help reduce centralization onto a handful of dApp service providers. |
Similar scenarios in my node as well. I have 2 geth archival nodes connected to 2 beacon nodes and both seems to have completed merge. But the nodes seems to be syncing very slowly. PFB the status from both nodes GETH BEACON The slots(3 days behind current slot) in beacon seems to be in sync with the block shown in geth node. Can anyone confirm if this is healthy and our sync is well and good but a bit slow ? |
This has actually been fixed, but not pushed to So, I recommend switching geth-version.
I recommend
These numbers aren't great -- as you can sees, importing one block takes almost |
Oh, btw -- is there really a point in keeping this bug open? Seems like the discussion has trailed off into new areas? |
Thanks for the update @holiman Is there anyway to increase the import time, I am already using an 16TB SSD and the nodes are in AKS as well. Any sol to increase peers or network bandwidth so that archival node stays upto date ? |
Increasing peers will most likely have a negative impact. I don't know what AKS means. If you are interested in trying something new, we have a new database backend which we think will be an improvement: #24615 . Unfortunately, we don't have any conversion-tool, so the only thing one can do with it is to sync from scratch -- which is a PITA with archive node, I know, so I don't actually expect you to try it out. Other than that, I don't have a lot of ideas. |
@holiman Got it fixed and I started running. But I get this INFO for blocks multiple times for a same block. --GETH-- |
I have the same issue after upgrading geth syncmode from Full to Snap. 12 Hours later it is still syncing with no clear indication on when or if the process will end. There are some posts that the parameter "accounts" should reach the total number of Ethereum accounts (201M today) for the snap sync process to end. Which in this case, means it will never ends. Or shall I simply be more patient? Sep 23 12:34:04 eth2-validator-0 bash[24081]: INFO [09-23|12:34:04.490] State heal in progress accounts=169,292@9.01MiB slots=293,066@22.30MiB codes=401@2.71MiB nodes=2,170,769@585.82MiB pending=20889 |
In an attempt to follow @holiman recommendations, I failed to locate the latest version of geth (a.k.a 1.11.xx -unstable). The last version published on github is the 1.10.25. A google seach on "go-ethereum v1.11.1" is not helpful either. Any help is welcome |
All it took was enough patience. Snap Sync finalized after 20 hours on a GCloud 16vCPU (e2-standard-16) + Balanced Persistent Disks. It will be downgraded to e2-standard-4 now that it is fully synced. Contrary to what is reported in other posts the parameter accounts in the log file below does not need to reach the total number of Ethereum accounts for the snap sync process to finalize. Sep 23 12:34:04 eth2-validator-0 bash[24081]: INFO [09-23|12:34:04.490] State heal in progress accounts=169,292@9.01MiB slots=293,066@22.30MiB codes=401@2.71MiB nodes=2,170,769@585.82MiB pending=20889 |
@Yvelo The issue is mine is an archival sync node :D So I think I need a lot more patience. Happily it has reached 6 hours behind |
Didn't see this until now. On any of our "Releases" pages, e.g. https://github.com/ethereum/go-ethereum/releases/tag/v1.10.25 , we link to
On that download-page, we have 'stable' and 'develop' builds. |
I am doing sync from scratch with geth version: 1.10.25-stable With this command line: The datadir is now about 605G, and the "State heal in progress" started about 2 days ago, and now the logs is showing:
So it is progressing about 1M accounts per day? At this rate, I need to wait 200 days for it to be fully sync to 200M accounts? @Yvelo can you share what is the number of account shown in your logs when it's fully sync? is it around 169K accounts? What indicator in the log that can be used to estimate the time for the "State heal" to finish? If I run this command:
and type:
There is no saying what is the "highest" number of trie nodes are there? |
No, that's not how it works. But I'm curious, why are you using
There's no point trying to infer the state of the sync based on now many nodes or accounts you have synced. I know it's tempting (users compare those numbers all the time), but it's not a good metric. It's like two people walking upwards in an escalator going downwards: compare how many steps they take to reach the top. But they walk at different speeds. And one started near the bottom, another close to the top. |
Thanks for the prompt reply, @holiman! I used 1.10.25 because I started using geth around 5 days ago without knowing about this bug. I just I read about this bug now. Hearing that I should delete everything and start from scratch again feels discouraging. If I were to use 1.11.xx, what is the expected time to sync from scratch to full? Should I use the same command line as above? or is there a faster way to do it? perhaps like a checkpoint sync? I am wondering, so there is no good metric at all to tell what is the ETA for the state heal? I don't need to compare with other people if the log shows any indicator about the ETA ... |
Just switch then, you don't have to restart from scratch.
A couple of days. Yes. No. No.
No. If there were, you can rest assured we would have added it and displayed it promptly to the users. It's something that users have requested consistently since 2016. |
I cannot install 1.11.xx from brew on MacOS using this doc: https://geth.ethereum.org/docs/install-and-build/installing-geth Running:
I got:
If I go to the downloads page: https://geth.ethereum.org/downloads/ I can't seem to find the 1.11 on the "Develop builds" section? it looks empty: When will the 1.11 be released into stable? |
Hm, yes, that's some bug in our pipeline. We'll look into it. |
Fixed, macos binaries available again |
It is quite frustrating experience when using geth to sync from scratch without any indicator of when it will complete. The logs did print the number of accounts: mine showing about 1M accounts, however AFAIK there are 200M accounts, and I've been syncing for 4+ days nonstop. At this rate, it will take 1+ years to fully sync. I am not comfortable of using the unstable 1.11 version. I don't know what other issues it will have? I don't want to risk it for now. So I gave besu a try, syncing from scratch using checkpoint, and it finished in a bit less than 2 days using this command:
The logs is also very clear, it shows:
The current number of nodes is 15790393, that is when I know it's near finish. Compare that with the log from geth:
The number of nodes is I hope my feedback can make geth logs better. |
As I have already said: There's no point trying to infer the state of the sync based on now many nodes or accounts you have synced. I know it's tempting (users compare those numbers all the time), but it's not a good metric. You cannot extrapolate the remaining state-heal time based on those numbers.
Good to hear!
No, as I've said, that's not the point. If it could be done, we would be doing it already, and show you the estimate. |
I have a similar problem. When are you plan to add this fix to ethereum/client-go stable docker tag? |
Weird behaviour from geth when putting a client online after it's been offline for a few months. I would really appreciate if someone would explain me what's going on / put me on the right track.
So far happened:
I am using the "--syncmode=snap" option. When "--syncmode=snap" is enabled, I have to admit I am not sure which one of the two ways, (A) or (B), geth is syncing:
A. The full-node way:
B. Snapshot download:
Is there some criteria which one of the two ways geth ends up doing? Say, if state is enough old, then do (B) instead of (A)?
Obviously, I would prefer (A), since I'm not syncing from scratch. Should I then remove "--syncmode=snap" ?
Next, let's take a look at the logs:
Let's continue:
The client was running ~ 4 days until I got fed up and put it offline. Got this far:
I interprete that downloaded and verified blocks are ok (it seem to reach the latest block occasionally), but the client is unable to reconstruct the up-to-date world state "SC".
I don't understand why. Once the block is downloaded and verified you get the state change, so what is interfering here with the state update/reconstruction?
Is there some competition/contradiction going on between (A) and (B)? (see above)
Some details on my system:
8 x Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
2TB SSD drive: "Crucial BX500"
12 GB RAM (& all the swap you could wish for)
(as mentioned, this was able to snapsync from scratch before summer)
P. S. It would be important to be able to contribute to the network without being a 24/7 avail/on-call eth-client-massage service provider.. to be offline if needed, etc. Imagine that I'd done staking at this point..!
The text was updated successfully, but these errors were encountered: