Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to store most recent snapshot #22463

Closed
MysticRyuujin opened this issue Mar 8, 2021 · 16 comments
Closed

Failing to store most recent snapshot #22463

MysticRyuujin opened this issue Mar 8, 2021 · 16 comments

Comments

@MysticRyuujin
Copy link
Contributor

I'm running an archive node:

/usr/bin/geth --datadir=/data/geth/datadir --syncmode=full --gcmode=archive  --cache=8192 --ipcdisable --maxpeers=100 --light.serve 400

Removed some irrelevant flags: (E.G. http, ws, graphql, ethstats)

I asked over on the discord if I needed to be concerned about any of the new flags, I was told no, not really because I run wtih
--syncmode=full --gcmode=archive

This node is not yet fully in sync, it's back in the 10M block range.

The first time I started it up, I got a lot of logs like this:

INFO [03-08|09:27:33.196] Imported new chain segment               blocks=12  txs=1295 mgas=130.991 elapsed=8.039s     mgasps=16.294 number=10492292 hash="25dcf5…157edf" age=7mo3w18h dirty=0.00B
INFO [03-08|09:27:33.866] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m3.465s
INFO [03-08|09:27:34.423] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m4.023s
INFO [03-08|09:27:35.136] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m4.735s
INFO [03-08|09:27:35.803] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m5.402s
INFO [03-08|09:27:36.192] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m5.791s
INFO [03-08|09:27:37.075] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m6.674s
INFO [03-08|09:27:37.829] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m7.429s
INFO [03-08|09:27:38.612] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m8.212s
INFO [03-08|09:27:39.205] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m8.804s
INFO [03-08|09:27:40.461] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m10.061s

I then got this log entry:

INFO [03-08|09:37:34.174] Compacted snapshot area in database      elapsed=25m28.016s

Then, my node started spitting this out A LOT:

INFO [03-08|09:43:00.086] Imported new chain segment               blocks=16  txs=3430 mgas=190.630 elapsed=8.341s     mgasps=22.853 number=10493904 hash="bab0c5…d81b9b" age=7mo3w12h dirty=0.00B
INFO [03-08|09:43:00.551] Aborting state snapshot generation       root="5c7442…51d5e4" in="002afd…0ebda1" at="db3f6a…66e17a" accounts=61641 slots=63464 storage=7.50MiB    elapsed=5m26.376s  eta=138h6m44.263s
INFO [03-08|09:43:00.551] Resuming state snapshot generation       root="64dcef…c2148f" in="002afd…0ebda1" at="db3f6a…66e17a" accounts=61641 slots=63464 storage=7.50MiB    elapsed=5m26.377s  eta=138h6m45.787s
INFO [03-08|09:43:01.021] Aborting state snapshot generation       root="64dcef…c2148f" in="002afd…0ebda1" at="dfa37a…796135" accounts=61641 slots=63631 storage=7.51MiB    elapsed=5m26.847s  eta=138h18m41.787s
INFO [03-08|09:43:01.022] Resuming state snapshot generation       root="20aec8…41eb85" in="002afd…0ebda1" at="dfa37a…796135" accounts=61641 slots=63631 storage=7.51MiB    elapsed=5m26.847s  eta=138h18m41.787s
INFO [03-08|09:43:01.056] Aborting state snapshot generation       root="20aec8…41eb85" in="002afd…0ebda1" at="dff358…62eed7" accounts=61641 slots=63645 storage=7.51MiB    elapsed=5m26.881s  eta=138h19m33.583s
INFO [03-08|09:43:01.057] Resuming state snapshot generation       root="8ace18…c5dc63" in="002afd…0ebda1" at="dff358…62eed7" accounts=61641 slots=63645 storage=7.51MiB    elapsed=5m26.882s  eta=138h19m35.106s
INFO [03-08|09:43:01.771] Aborting state snapshot generation       root="8ace18…c5dc63" in="002afd…0ebda1" at="e6283f…9a2eb5" accounts=61641 slots=63897 storage=7.53MiB    elapsed=5m27.596s  eta=138h37m42.817s
INFO [03-08|09:43:01.771] Resuming state snapshot generation       root="2aea57…3775c0" in="002afd…0ebda1" at="e6283f…9a2eb5" accounts=61641 slots=63897 storage=7.53MiB    elapsed=5m27.596s  eta=138h37m42.817s
INFO [03-08|09:43:02.288] Aborting state snapshot generation       root="2aea57…3775c0" in="002afd…0ebda1" at="ea9729…6b6760" accounts=61641 slots=64088 storage=7.55MiB    elapsed=5m28.113s  eta=138h50m50.417s
INFO [03-08|09:43:02.288] Resuming state snapshot generation       root="d7da90…ac2149" in="002afd…0ebda1" at="ea9729…6b6760" accounts=61641 slots=64088 storage=7.55MiB    elapsed=5m28.113s  eta=138h50m50.417s
INFO [03-08|09:43:02.656] Aborting state snapshot generation       root="d7da90…ac2149" in="002afd…0ebda1" at="edc04d…4a9612" accounts=61641 slots=64214 storage=7.56MiB    elapsed=5m28.481s  eta=139h0m11.03s
INFO [03-08|09:43:02.657] Resuming state snapshot generation       root="efa4de…0e3ffb" in="002afd…0ebda1" at="edc04d…4a9612" accounts=61641 slots=64214 storage=7.56MiB    elapsed=5m28.483s  eta=139h0m14.077s
INFO [03-08|09:43:03.410] Aborting state snapshot generation       root="efa4de…0e3ffb" in="002afd…0ebda1" at="f23fb4…c89595" accounts=61641 slots=64430 storage=7.57MiB    elapsed=5m29.235s  eta=139h19m19.678s
INFO [03-08|09:43:03.410] Resuming state snapshot generation       root="1e8ed7…020767" in="002afd…0ebda1" at="f23fb4…c89595" accounts=61641 slots=64430 storage=7.57MiB    elapsed=5m29.236s  eta=139h19m21.201s
INFO [03-08|09:43:03.909] Aborting state snapshot generation       root="1e8ed7…020767" in="002afd…0ebda1" at="f67793…7079a1" accounts=61641 slots=64609 storage=7.58MiB    elapsed=5m29.734s  eta=139h31m59.857s
INFO [03-08|09:43:03.909] Resuming state snapshot generation       root="568ed7…dd7de5" in="002afd…0ebda1" at="f67793…7079a1" accounts=61641 slots=64609 storage=7.58MiB    elapsed=5m29.734s  eta=139h31m59.857s
INFO [03-08|09:43:04.014] Aborting state snapshot generation       root="568ed7…dd7de5" in="002afd…0ebda1" at="f76be1…ae32bb" accounts=61641 slots=64648 storage=7.59MiB    elapsed=5m29.839s  eta=139h34m39.814s
INFO [03-08|09:43:04.014] Resuming state snapshot generation       root="0f6239…aaa757" in="002afd…0ebda1" at="f76be1…ae32bb" accounts=61641 slots=64648 storage=7.59MiB    elapsed=5m29.839s  eta=139h34m39.814s
INFO [03-08|09:43:04.398] Aborting state snapshot generation       root="0f6239…aaa757" in="002afd…0ebda1" at="fa9f4d…c96399" accounts=61641 slots=64787 storage=7.60MiB    elapsed=5m30.223s  eta=139h44m24.801s
INFO [03-08|09:43:04.398] Resuming state snapshot generation       root="4ac96c…998ee5" in="002afd…0ebda1" at="fa9f4d…c96399" accounts=61641 slots=64787 storage=7.60MiB    elapsed=5m30.223s  eta=139h44m24.801s
INFO [03-08|09:43:04.907] Aborting state snapshot generation       root="4ac96c…998ee5" in="002afd…0ebda1" at="fef336…12e458" accounts=61641 slots=64963 storage=7.61MiB    elapsed=5m30.733s  eta=139h57m21.738s
INFO [03-08|09:43:04.908] Resuming state snapshot generation       root="d6cbea…2634b1" in="002afd…0ebda1" at="fef336…12e458" accounts=61641 slots=64963 storage=7.61MiB    elapsed=5m30.733s  eta=139h57m21.738s

Question: Is this expected behavior?

@MysticRyuujin
Copy link
Contributor Author

MysticRyuujin commented Mar 8, 2021

On discord:

YorickDowne (he/him):
That is normal behavior. It's stopping and starting snapshot gen so it runs "in between block"
Could be as simple at making it log "Pausing state snapshot generation during block import" instead of "Aborting state snapshot generation"

@holiman
Copy link
Contributor

holiman commented Mar 8, 2021

The Wiper running is exactly what it sounds like, it's a clean-up thing that wipes out earlier snapshot-remnants. This may happen if, for instance, you start with snapshot enabled, then exit, import a block without snapshot enabled, exit, start again with snapshot enabed. Then the snapshot base image goes out of sync with the trie, and the complete snapshot is garbage, and needs to be wiped.

After it was done doing so, it started regenerating the snapshot. It does so in the periods between blocks, which usually means it can crunch snapshot for 14 seconds until the next block comes along. In the case of a full-sync, it seems to be a bit counter-productive.

INFO [03-08|09:43:00.551] accounts=61641 slots=63464
...
INFO [03-08|09:43:04.908 accounts=61641 slots=64963

It managed to go through an additional ~1500 storage slots, but the starts and stops don't help.

@ryanschneider
Copy link
Contributor

@holiman we are seeing this happen on some nodes after a restart even if the snapshot flag isn't toggled.

For example, this clean shutdown triggered it, notice the error on shutdown around snapshot generation:

Mar 10 09:58:58 ip-10-0-69-249 geth[57822]: INFO [03-10|09:58:58.035] Imported new chain segment               blocks=1  txs=2   mgas=2.566   elapsed=6.492ms   mgasps=395.204 number=9810917 hash=
Mar 10 09:59:07 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:07.744] Deep froze chain segment                 blocks=7  elapsed=3.326ms   number=9720917 hash="ebdc89…5dc267"
Mar 10 09:59:10 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:10.196] Imported new chain segment               blocks=1  txs=2   mgas=0.068   elapsed=5.397ms   mgasps=12.634  number=9810918 hash=
Mar 10 09:59:11 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:11.013] Writing clean trie cache to disk         path=/mnt/local/ethereum/geth/triecache threads=1
Mar 10 09:59:11 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:11.320] Got interrupt, shutting down...
...
Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.416] Persisted the clean trie cache           path=/mnt/local/ethereum/geth/triecache elapsed=2.402s

Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: ERROR[03-10|09:59:13.421] Failed to journal state snapshot         err="snapshot [0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f] missing"
^^^ this one ^^^

Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.421] Writing clean trie cache to disk         path=/mnt/local/ethereum/geth/triecache threads=16
Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.777] Persisted the clean trie cache           path=/mnt/local/ethereum/geth/triecache elapsed=355.595ms
Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.777] Blockchain stopped

On the next startup the node then rebuilds the snapshot index:

Mar 10 10:15:38 ip-10-0-69-249 geth[96580]: WARN [03-10|10:15:38.505] Failed to load snapshot, regenerating    err="head doesn't match snapshot: have 0x692deb981212e9adddba641f7346938b89b3a514390833addd1397f69e340438, want 0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f"
Mar 10 10:15:38 ip-10-0-69-249 geth[96580]: INFO [03-10|10:15:38.505] Rebuilding state snapshot
Mar 10 10:15:38 ip-10-0-69-249 geth[96580]: INFO [03-10|10:15:38.505] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed="560.22µs"

I think there might be a race of sorts around node shutdown leading to unneeded regeneration.

@holiman holiman changed the title Interesting logs for 1.10.1 - Normal? Failing to store most recent snapshot Mar 11, 2021
@holiman
Copy link
Contributor

holiman commented Mar 11, 2021

@ryanschneider do you have more logs about what happens during that clean shutdown ?

@holiman
Copy link
Contributor

holiman commented Mar 11, 2021

block: https://ropsten.etherscan.io/block/9810918 stateroot 0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f

@ryanschneider
Copy link
Contributor

Unfortunately the server in question has already been recycled, but I was able to recover some of the log from my terminal buffer, hopefully this is enough: https://gist.github.com/ryanschneider/62f06149a215ccf7ccc72a239fa7e42f

Seems like the server was operating fine before the shutdown. FWIW this node was only being used to sync the chain and make backups of the leveldb (we stop geth every 4 hours and create backup then start it up again), so it shouldn't have had any load except for the usual devp2p traffic, no RPC traffic was going to this node.

@holiman
Copy link
Contributor

holiman commented Mar 12, 2021

This is interesting:
Failed to load snapshot, regenerating err="head doesn't match snapshot: have 0x692deb981212e9adddba641f7346938b89b3a514390833addd1397f69e340438, want 0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f"
I've been iterating through the stateroots on ropsten, and I can't find 0x692deb981212e9adddba641f7346938b89b3a514390833addd1397f69e340438. My next though was that maybe the snapshotter somehow got stopped, and that stateroot was the genesis, but nope, not even that. So right now i'm wondering if it might be some reorged-off uncle root that wound up as the last root. Really odd.

@holiman
Copy link
Contributor

holiman commented Mar 12, 2021

There are only two uncles in the chain around that time

uncle hash: 0x8d60e761118902146673077635e96c619c53721cd5eacd0516e381b4fa559679 height 9720929 (uncle height 9720928) 
uncle hash: 0x60a346e088ce09686b6529add395281c9a3ff79cce5375721f8f0ec1a820ce25 height 9720920 (uncle height 9720919)

One minted at 19 and one at 28. Doesn't seem likely that any of them has anything to do with this.

@ryanschneider
Copy link
Contributor

My next though was that maybe the snapshotter somehow got stopped

What if instead of genesis that's the state root of some partial block? Like if the shutdown context triggered mid import and a state transition was aborted but was accidentally sent to the snapshotter instead of being discarded?

I haven't looked at the code enough to say if this is possible just throwing it out there as a possible hypothesis.

@ryanschneider
Copy link
Contributor

@holiman looking a bit closer at my log, isn't it a little odd that Writing clean trie cache to disk was logged twice during shutdown?

AFAICT that would only happen if Database.SaveCachePeriodically just happened to fire around the time of shutdown, and then Database.SaveCache was called in BlockChain.Stop. It looks like that must've been the case since the first Writing clean trie was logged before the interrupt signal.

I feel like there's potentially something a little racy going on between BlockChain.Stop and Database.SaveCachePeriodically and the snapshot journal being written in StateDB.Commit but I need to spent a bit more time in that code to get a feel for what feels off to me.

@no-response
Copy link

no-response bot commented Apr 10, 2021

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

@no-response no-response bot closed this as completed Apr 10, 2021
@ryanschneider
Copy link
Contributor

@holiman @ligi I think this was closed in error, are my logs enough to satisfy the needs information tag?

@MariusVanDerWijden
Copy link
Member

Sorry, the no-response bot is a bit hyperactive sometimes. It doesn't recognize "more information" from others than the op and closes some issues prematurely

@no-response
Copy link

no-response bot commented Apr 10, 2021

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

@no-response no-response bot closed this as completed Apr 10, 2021
@holiman
Copy link
Contributor

holiman commented Apr 10, 2021 via email

@xanoni
Copy link

xanoni commented Aug 24, 2021

Any news here? I get the Failed to journal state snapshot whenever I stop geth (clean restart via SIGINT).

See here (but it's the same with Fast Sync):
#23228 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants