Failing to store most recent snapshot #22463

MysticRyuujin · 2021-03-08T15:50:33Z

I'm running an archive node:

/usr/bin/geth --datadir=/data/geth/datadir --syncmode=full --gcmode=archive  --cache=8192 --ipcdisable --maxpeers=100 --light.serve 400

Removed some irrelevant flags: (E.G. http, ws, graphql, ethstats)

I asked over on the discord if I needed to be concerned about any of the new flags, I was told no, not really because I run wtih
--syncmode=full --gcmode=archive

This node is not yet fully in sync, it's back in the 10M block range.

The first time I started it up, I got a lot of logs like this:

INFO [03-08|09:27:33.196] Imported new chain segment               blocks=12  txs=1295 mgas=130.991 elapsed=8.039s     mgasps=16.294 number=10492292 hash="25dcf5…157edf" age=7mo3w18h dirty=0.00B
INFO [03-08|09:27:33.866] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m3.465s
INFO [03-08|09:27:34.423] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m4.023s
INFO [03-08|09:27:35.136] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m4.735s
INFO [03-08|09:27:35.803] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m5.402s
INFO [03-08|09:27:36.192] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m5.791s
INFO [03-08|09:27:37.075] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m6.674s
INFO [03-08|09:27:37.829] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m7.429s
INFO [03-08|09:27:38.612] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m8.212s
INFO [03-08|09:27:39.205] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m8.804s
INFO [03-08|09:27:40.461] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed=19m10.061s

I then got this log entry:

INFO [03-08|09:37:34.174] Compacted snapshot area in database      elapsed=25m28.016s

Then, my node started spitting this out A LOT:

INFO [03-08|09:43:00.086] Imported new chain segment               blocks=16  txs=3430 mgas=190.630 elapsed=8.341s     mgasps=22.853 number=10493904 hash="bab0c5…d81b9b" age=7mo3w12h dirty=0.00B
INFO [03-08|09:43:00.551] Aborting state snapshot generation       root="5c7442…51d5e4" in="002afd…0ebda1" at="db3f6a…66e17a" accounts=61641 slots=63464 storage=7.50MiB    elapsed=5m26.376s  eta=138h6m44.263s
INFO [03-08|09:43:00.551] Resuming state snapshot generation       root="64dcef…c2148f" in="002afd…0ebda1" at="db3f6a…66e17a" accounts=61641 slots=63464 storage=7.50MiB    elapsed=5m26.377s  eta=138h6m45.787s
INFO [03-08|09:43:01.021] Aborting state snapshot generation       root="64dcef…c2148f" in="002afd…0ebda1" at="dfa37a…796135" accounts=61641 slots=63631 storage=7.51MiB    elapsed=5m26.847s  eta=138h18m41.787s
INFO [03-08|09:43:01.022] Resuming state snapshot generation       root="20aec8…41eb85" in="002afd…0ebda1" at="dfa37a…796135" accounts=61641 slots=63631 storage=7.51MiB    elapsed=5m26.847s  eta=138h18m41.787s
INFO [03-08|09:43:01.056] Aborting state snapshot generation       root="20aec8…41eb85" in="002afd…0ebda1" at="dff358…62eed7" accounts=61641 slots=63645 storage=7.51MiB    elapsed=5m26.881s  eta=138h19m33.583s
INFO [03-08|09:43:01.057] Resuming state snapshot generation       root="8ace18…c5dc63" in="002afd…0ebda1" at="dff358…62eed7" accounts=61641 slots=63645 storage=7.51MiB    elapsed=5m26.882s  eta=138h19m35.106s
INFO [03-08|09:43:01.771] Aborting state snapshot generation       root="8ace18…c5dc63" in="002afd…0ebda1" at="e6283f…9a2eb5" accounts=61641 slots=63897 storage=7.53MiB    elapsed=5m27.596s  eta=138h37m42.817s
INFO [03-08|09:43:01.771] Resuming state snapshot generation       root="2aea57…3775c0" in="002afd…0ebda1" at="e6283f…9a2eb5" accounts=61641 slots=63897 storage=7.53MiB    elapsed=5m27.596s  eta=138h37m42.817s
INFO [03-08|09:43:02.288] Aborting state snapshot generation       root="2aea57…3775c0" in="002afd…0ebda1" at="ea9729…6b6760" accounts=61641 slots=64088 storage=7.55MiB    elapsed=5m28.113s  eta=138h50m50.417s
INFO [03-08|09:43:02.288] Resuming state snapshot generation       root="d7da90…ac2149" in="002afd…0ebda1" at="ea9729…6b6760" accounts=61641 slots=64088 storage=7.55MiB    elapsed=5m28.113s  eta=138h50m50.417s
INFO [03-08|09:43:02.656] Aborting state snapshot generation       root="d7da90…ac2149" in="002afd…0ebda1" at="edc04d…4a9612" accounts=61641 slots=64214 storage=7.56MiB    elapsed=5m28.481s  eta=139h0m11.03s
INFO [03-08|09:43:02.657] Resuming state snapshot generation       root="efa4de…0e3ffb" in="002afd…0ebda1" at="edc04d…4a9612" accounts=61641 slots=64214 storage=7.56MiB    elapsed=5m28.483s  eta=139h0m14.077s
INFO [03-08|09:43:03.410] Aborting state snapshot generation       root="efa4de…0e3ffb" in="002afd…0ebda1" at="f23fb4…c89595" accounts=61641 slots=64430 storage=7.57MiB    elapsed=5m29.235s  eta=139h19m19.678s
INFO [03-08|09:43:03.410] Resuming state snapshot generation       root="1e8ed7…020767" in="002afd…0ebda1" at="f23fb4…c89595" accounts=61641 slots=64430 storage=7.57MiB    elapsed=5m29.236s  eta=139h19m21.201s
INFO [03-08|09:43:03.909] Aborting state snapshot generation       root="1e8ed7…020767" in="002afd…0ebda1" at="f67793…7079a1" accounts=61641 slots=64609 storage=7.58MiB    elapsed=5m29.734s  eta=139h31m59.857s
INFO [03-08|09:43:03.909] Resuming state snapshot generation       root="568ed7…dd7de5" in="002afd…0ebda1" at="f67793…7079a1" accounts=61641 slots=64609 storage=7.58MiB    elapsed=5m29.734s  eta=139h31m59.857s
INFO [03-08|09:43:04.014] Aborting state snapshot generation       root="568ed7…dd7de5" in="002afd…0ebda1" at="f76be1…ae32bb" accounts=61641 slots=64648 storage=7.59MiB    elapsed=5m29.839s  eta=139h34m39.814s
INFO [03-08|09:43:04.014] Resuming state snapshot generation       root="0f6239…aaa757" in="002afd…0ebda1" at="f76be1…ae32bb" accounts=61641 slots=64648 storage=7.59MiB    elapsed=5m29.839s  eta=139h34m39.814s
INFO [03-08|09:43:04.398] Aborting state snapshot generation       root="0f6239…aaa757" in="002afd…0ebda1" at="fa9f4d…c96399" accounts=61641 slots=64787 storage=7.60MiB    elapsed=5m30.223s  eta=139h44m24.801s
INFO [03-08|09:43:04.398] Resuming state snapshot generation       root="4ac96c…998ee5" in="002afd…0ebda1" at="fa9f4d…c96399" accounts=61641 slots=64787 storage=7.60MiB    elapsed=5m30.223s  eta=139h44m24.801s
INFO [03-08|09:43:04.907] Aborting state snapshot generation       root="4ac96c…998ee5" in="002afd…0ebda1" at="fef336…12e458" accounts=61641 slots=64963 storage=7.61MiB    elapsed=5m30.733s  eta=139h57m21.738s
INFO [03-08|09:43:04.908] Resuming state snapshot generation       root="d6cbea…2634b1" in="002afd…0ebda1" at="fef336…12e458" accounts=61641 slots=64963 storage=7.61MiB    elapsed=5m30.733s  eta=139h57m21.738s

Question: Is this expected behavior?

The text was updated successfully, but these errors were encountered:

MysticRyuujin · 2021-03-08T15:58:44Z

On discord:

YorickDowne (he/him):
That is normal behavior. It's stopping and starting snapshot gen so it runs "in between block"
Could be as simple at making it log "Pausing state snapshot generation during block import" instead of "Aborting state snapshot generation"

holiman · 2021-03-08T17:15:21Z

The Wiper running is exactly what it sounds like, it's a clean-up thing that wipes out earlier snapshot-remnants. This may happen if, for instance, you start with snapshot enabled, then exit, import a block without snapshot enabled, exit, start again with snapshot enabed. Then the snapshot base image goes out of sync with the trie, and the complete snapshot is garbage, and needs to be wiped.

After it was done doing so, it started regenerating the snapshot. It does so in the periods between blocks, which usually means it can crunch snapshot for 14 seconds until the next block comes along. In the case of a full-sync, it seems to be a bit counter-productive.

INFO [03-08|09:43:00.551] accounts=61641 slots=63464
...
INFO [03-08|09:43:04.908 accounts=61641 slots=64963

It managed to go through an additional ~1500 storage slots, but the starts and stops don't help.

ryanschneider · 2021-03-10T20:21:17Z

@holiman we are seeing this happen on some nodes after a restart even if the snapshot flag isn't toggled.

For example, this clean shutdown triggered it, notice the error on shutdown around snapshot generation:

Mar 10 09:58:58 ip-10-0-69-249 geth[57822]: INFO [03-10|09:58:58.035] Imported new chain segment               blocks=1  txs=2   mgas=2.566   elapsed=6.492ms   mgasps=395.204 number=9810917 hash=
Mar 10 09:59:07 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:07.744] Deep froze chain segment                 blocks=7  elapsed=3.326ms   number=9720917 hash="ebdc89…5dc267"
Mar 10 09:59:10 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:10.196] Imported new chain segment               blocks=1  txs=2   mgas=0.068   elapsed=5.397ms   mgasps=12.634  number=9810918 hash=
Mar 10 09:59:11 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:11.013] Writing clean trie cache to disk         path=/mnt/local/ethereum/geth/triecache threads=1
Mar 10 09:59:11 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:11.320] Got interrupt, shutting down...
...
Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.416] Persisted the clean trie cache           path=/mnt/local/ethereum/geth/triecache elapsed=2.402s

Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: ERROR[03-10|09:59:13.421] Failed to journal state snapshot         err="snapshot [0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f] missing"
^^^ this one ^^^

Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.421] Writing clean trie cache to disk         path=/mnt/local/ethereum/geth/triecache threads=16
Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.777] Persisted the clean trie cache           path=/mnt/local/ethereum/geth/triecache elapsed=355.595ms
Mar 10 09:59:13 ip-10-0-69-249 geth[57822]: INFO [03-10|09:59:13.777] Blockchain stopped

On the next startup the node then rebuilds the snapshot index:

Mar 10 10:15:38 ip-10-0-69-249 geth[96580]: WARN [03-10|10:15:38.505] Failed to load snapshot, regenerating    err="head doesn't match snapshot: have 0x692deb981212e9adddba641f7346938b89b3a514390833addd1397f69e340438, want 0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f"
Mar 10 10:15:38 ip-10-0-69-249 geth[96580]: INFO [03-10|10:15:38.505] Rebuilding state snapshot
Mar 10 10:15:38 ip-10-0-69-249 geth[96580]: INFO [03-10|10:15:38.505] Wiper running, state snapshotting paused accounts=0 slots=0 storage=0.00B elapsed="560.22µs"

I think there might be a race of sorts around node shutdown leading to unneeded regeneration.

holiman · 2021-03-11T09:58:03Z

@ryanschneider do you have more logs about what happens during that clean shutdown ?

holiman · 2021-03-11T10:00:16Z

block: https://ropsten.etherscan.io/block/9810918 stateroot 0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f

ryanschneider · 2021-03-11T16:03:29Z

Unfortunately the server in question has already been recycled, but I was able to recover some of the log from my terminal buffer, hopefully this is enough: https://gist.github.com/ryanschneider/62f06149a215ccf7ccc72a239fa7e42f

Seems like the server was operating fine before the shutdown. FWIW this node was only being used to sync the chain and make backups of the leveldb (we stop geth every 4 hours and create backup then start it up again), so it shouldn't have had any load except for the usual devp2p traffic, no RPC traffic was going to this node.

holiman · 2021-03-12T08:02:11Z

This is interesting:
Failed to load snapshot, regenerating err="head doesn't match snapshot: have 0x692deb981212e9adddba641f7346938b89b3a514390833addd1397f69e340438, want 0x1b16058a5c49432bf193be2e7534f6fcde585f45c84c3ffd14e03e7d85204e3f"
I've been iterating through the stateroots on ropsten, and I can't find 0x692deb981212e9adddba641f7346938b89b3a514390833addd1397f69e340438. My next though was that maybe the snapshotter somehow got stopped, and that stateroot was the genesis, but nope, not even that. So right now i'm wondering if it might be some reorged-off uncle root that wound up as the last root. Really odd.

holiman · 2021-03-12T08:54:44Z

There are only two uncles in the chain around that time

uncle hash: 0x8d60e761118902146673077635e96c619c53721cd5eacd0516e381b4fa559679 height 9720929 (uncle height 9720928) 
uncle hash: 0x60a346e088ce09686b6529add395281c9a3ff79cce5375721f8f0ec1a820ce25 height 9720920 (uncle height 9720919)

One minted at 19 and one at 28. Doesn't seem likely that any of them has anything to do with this.

ryanschneider · 2021-03-12T18:56:11Z

My next though was that maybe the snapshotter somehow got stopped

What if instead of genesis that's the state root of some partial block? Like if the shutdown context triggered mid import and a state transition was aborted but was accidentally sent to the snapshotter instead of being discarded?

I haven't looked at the code enough to say if this is possible just throwing it out there as a possible hypothesis.

ryanschneider · 2021-03-15T23:34:40Z

@holiman looking a bit closer at my log, isn't it a little odd that Writing clean trie cache to disk was logged twice during shutdown?

AFAICT that would only happen if Database.SaveCachePeriodically just happened to fire around the time of shutdown, and then Database.SaveCache was called in BlockChain.Stop. It looks like that must've been the case since the first Writing clean trie was logged before the interrupt signal.

I feel like there's potentially something a little racy going on between BlockChain.Stop and Database.SaveCachePeriodically and the snapshot journal being written in StateDB.Commit but I need to spent a bit more time in that code to get a feel for what feels off to me.

no-response · 2021-04-10T12:10:28Z

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

ryanschneider · 2021-04-10T15:07:40Z

@holiman @ligi I think this was closed in error, are my logs enough to satisfy the needs information tag?

MariusVanDerWijden · 2021-04-10T18:11:39Z

Sorry, the no-response bot is a bit hyperactive sometimes. It doesn't recognize "more information" from others than the op and closes some issues prematurely

no-response · 2021-04-10T18:18:11Z

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

holiman · 2021-04-10T20:40:09Z

Hah! The bot doesn't respect your authoritah!

xanoni · 2021-08-24T06:40:22Z

Any news here? I get the Failed to journal state snapshot whenever I stop geth (clean restart via SIGINT).

See here (but it's the same with Fast Sync):
#23228 (comment)

MysticRyuujin added the type:docs label Mar 8, 2021

holiman added the status:triage label Mar 11, 2021

holiman changed the title ~~Interesting logs for 1.10.1 - Normal?~~ Failing to store most recent snapshot Mar 11, 2021

ligi added need:more-information and removed status:triage labels Mar 11, 2021

no-response bot closed this as completed Apr 10, 2021

MariusVanDerWijden reopened this Apr 10, 2021

no-response bot closed this as completed Apr 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to store most recent snapshot #22463

Failing to store most recent snapshot #22463

MysticRyuujin commented Mar 8, 2021

MysticRyuujin commented Mar 8, 2021 •

edited

Loading

holiman commented Mar 8, 2021

ryanschneider commented Mar 10, 2021

holiman commented Mar 11, 2021

holiman commented Mar 11, 2021

ryanschneider commented Mar 11, 2021

holiman commented Mar 12, 2021

holiman commented Mar 12, 2021

ryanschneider commented Mar 12, 2021

ryanschneider commented Mar 15, 2021

no-response bot commented Apr 10, 2021

ryanschneider commented Apr 10, 2021

MariusVanDerWijden commented Apr 10, 2021

no-response bot commented Apr 10, 2021

holiman commented Apr 10, 2021 via email

xanoni commented Aug 24, 2021

Failing to store most recent snapshot #22463

Failing to store most recent snapshot #22463

Comments

MysticRyuujin commented Mar 8, 2021

MysticRyuujin commented Mar 8, 2021 • edited Loading

holiman commented Mar 8, 2021

ryanschneider commented Mar 10, 2021

holiman commented Mar 11, 2021

holiman commented Mar 11, 2021

ryanschneider commented Mar 11, 2021

holiman commented Mar 12, 2021

holiman commented Mar 12, 2021

ryanschneider commented Mar 12, 2021

ryanschneider commented Mar 15, 2021

no-response bot commented Apr 10, 2021

ryanschneider commented Apr 10, 2021

MariusVanDerWijden commented Apr 10, 2021

no-response bot commented Apr 10, 2021

holiman commented Apr 10, 2021 via email

xanoni commented Aug 24, 2021

MysticRyuujin commented Mar 8, 2021 •

edited

Loading